Configuration
The data directory, the raw and work trees, the index, the environment, the engine, and the global flags with their defaults.
arctic needs almost no configuration. There is no config file; every option is a flag or an environment variable, and the defaults are chosen so the common case needs neither. See everything arctic resolved with:
arctic info
It prints the detected hardware, the work budget, the active engine, and the resolved storage paths.
The data directory
arctic keeps its state under one tree. It defaults to $XDG_DATA_HOME/arctic,
which is ~/.local/share/arctic on Linux when XDG_DATA_HOME is unset. Point it
elsewhere with --data-dir or the ARCTIC_DATA_DIR environment variable:
export ARCTIC_DATA_DIR=/mnt/big/arctic
arctic sub golang
Under the data directory live the downloaded .zst files, the per-entity Parquet
imports, the scratch work directory, and the SQLite index.
The raw and work directories
Two of those subtrees can be split out to their own paths, which is useful when your downloads and your conversion scratch want different disks:
| Flag | Default | What lives there |
|---|---|---|
--raw-dir |
under the data dir | Downloaded .zst dump files |
--work-dir |
under the data dir | Conversion scratch and the Parquet shards |
arctic pull 2024-01..2024-03 --process \
--raw-dir /mnt/spinning/raw \
--work-dir /mnt/ssd/work
The work-dir disk is also where arctic info measures disk_free_gb, since
that is where the conversion happens.
The index
Each import is recorded in a small SQLite index under the data directory.
arctic stats reads it to summarize what you hold by month, type, or subreddit
without rescanning the Parquet. arctic info reports its path as index_path.
You do not configure it directly; it follows the data directory.
The conversion engine
--engine chooses how JSONL becomes Parquet:
| Value | Needs | Notes |
|---|---|---|
go |
nothing | Pure Go, the default, in every binary |
duckdb |
a -tags duckdb build with cgo |
Faster on a capable host; respects duckdb_memory_mb |
A pure-Go binary asked for --engine duckdb exits with a usage error. arctic info shows duckdb_available so you know which build you have. See
the hardware budget.
The API request identity
The Arctic Shift API path sends a default User-Agent. Override it with
--user-agent if you are identifying your own client, and set the per-request
timeout with --timeout:
arctic sub golang --api --user-agent "my-archiver/1.0" --timeout 90s
Sizing knobs
| Flag | Default | Meaning |
|---|---|---|
-j, --workers |
0 |
Concurrent workers; 0 uses the computed budget |
--chunk-lines |
500000 |
JSONL lines per Parquet shard |
--timeout |
60s |
Per-request timeout on the API path |
--workers 0 lets arctic size the parallelism from the detected machine. A
non-zero value overrides the convert and download caps (and the API fetch
concurrency). See the hardware budget.
Environment variables
Every variable is overridden by the matching flag when you pass one, so the order of precedence is flag, then environment, then the built-in default.
| Variable | Used for |
|---|---|
ARCTIC_DATA_DIR |
Root data directory (overrides the XDG default) |
ARCTIC_RAW_DIR |
Where downloaded .zst files land |
ARCTIC_WORK_DIR |
Conversion scratch and the Parquet shards |
ARCTIC_REPO_ROOT |
Local staging tree for a publish run |
ARCTIC_ENGINE |
Conversion engine, go or duckdb |
ARCTIC_CHUNK_LINES |
JSONL lines per Parquet shard |
ARCTIC_MIN_FREE_GB |
Free-disk floor a publish refuses to start below |
HF_TOKEN |
Hugging Face write token for publish --commit |
Global flags
| Flag | Default | Meaning |
|---|---|---|
-o, --output |
auto | table, json, jsonl, csv, tsv, url, raw |
--fields |
all | Comma-separated columns to include |
--no-header |
off | Omit the header row in table/csv/tsv |
--template |
none | Go text/template applied per record |
--color |
auto | auto, always, or never |
-n, --limit |
0 |
Maximum records; 0 is no limit |
-q, --quiet |
off | Suppress progress on stderr |
-j, --workers |
0 |
Concurrent workers; 0 uses the budget |
--data-dir |
XDG | Root data directory (env ARCTIC_DATA_DIR) |
--raw-dir |
under data dir | Where downloaded .zst files land |
--work-dir |
under data dir | Conversion scratch and shards |
--engine |
go |
Conversion engine: go or duckdb |
--chunk-lines |
500000 |
JSONL lines per Parquet shard |
--timeout |
60s |
Per-request timeout for the API path |
--user-agent |
arctic default | User-Agent on the API path |
Output auto-detection
With no -o, the default output format adapts to where it is going: an aligned
table when the output is a terminal, JSONL when it is piped. See
output formats.
Exit codes
| Code | Meaning |
|---|---|
0 |
OK |
1 |
Error |
2 |
Usage error |
3 |
No data (nothing matched, published, or imported) |
4 |
Partial (some targets in a batch failed) |
5 |
Blocked (the API rate-limited or refused) |
75 |
Commit stalled (restart a publish run to resume) |