The hardware budget
How arctic sizes its work to the machine, what arctic info reports, and the choice between the Go and DuckDB conversion engines.
Converting the bulk archive is bounded by CPU, memory, and disk. A small laptop and a large server should not run the same number of parallel conversions, so arctic detects the machine it is on and derives a work budget from it rather than assuming a fixed level of parallelism.
See the budget
arctic info
That reports the detected hardware and the budget arctic computed from it. The hardware fields:
| Field | Meaning |
|---|---|
os, hostname |
The host arctic detected |
cpus |
Logical CPU count |
ram_total_gb |
Total system memory |
ram_available_gb |
Memory available right now |
disk_free_gb |
Free space on the work directory's disk |
The budget fields derived from them:
| Field | Meaning |
|---|---|
max_downloads |
How many dumps to download at once |
max_process |
How many files to convert at once |
max_convert_workers |
Worker threads inside one conversion |
duckdb_memory_mb |
Memory ceiling handed to the DuckDB engine |
sequential |
Whether the host is small enough to run strictly one step at a time |
And the storage and engine fields:
| Field | Meaning |
|---|---|
engine |
The active conversion engine (go or duckdb) |
duckdb_available |
Whether this binary was built with the DuckDB engine |
data_dir, raw_dir, work_dir, index_path |
The resolved storage paths |
The sequential fallback
On a small host, running downloads and conversions in parallel competes for the
same scarce memory and disk and ends up slower than doing one thing at a time. So
the budget carries a sequential flag: when the detected memory or core count is
low, arctic runs strictly one step after another. You do not set this; it follows
from the hardware. A larger host runs the downloads and conversions concurrently
up to the max_* caps.
Overriding the caps
--workers overrides the convert and download caps when you want to set the
parallelism by hand, for example to leave headroom for other work or to push a
capable machine harder:
arctic pull 2024-01..2024-03 --process --workers 4
arctic info --workers 4 # see the budget with the override applied
On the API path, --workers sets the fetch concurrency the same way. With
--workers 0 (the default) arctic uses the computed budget.
The two conversion engines
arctic ships two engines for the JSONL-to-Parquet step.
The Go engine (--engine go, the default) is pure Go and built into every
binary. It needs nothing installed alongside arctic and runs everywhere the
binary does. This is the right default and the only engine in the standard
release build.
The DuckDB engine (--engine duckdb) does the same conversion through
DuckDB's read_json and COPY ... TO. It is faster on a capable host and
respects the duckdb_memory_mb budget so it does not exhaust memory on a large
file. It needs a binary built with -tags duckdb and a cgo toolchain:
make build-duckdb
./bin/arctic pull 2024-01..2024-03 --process --engine duckdb
A pure-Go binary asked for --engine duckdb exits with a usage error telling you
to build with -tags duckdb, and arctic info shows duckdb_available: false,
so you always know which build you are running. Both engines write the same fixed
schema, so the choice is purely about speed and toolchain, and shards from one are
indistinguishable from the other at query time.
Disk reality
The dumps are large: hundreds of gigabytes uncompressed for a wide range, and the
.zst files plus the Parquet shards both live under the data directory while a
run is in flight. arctic info reports disk_free_gb so you can check before a
big pull, and arctic catalog --sizes reports the per-file download sizes. A
publish --commit run clears its local Parquet after each successful commit by
default to keep the footprint down; pass --keep to hold onto it. See
configuration for splitting the raw and work trees
onto different disks with --raw-dir and --work-dir.