The hardware budget

How arctic sizes its work to the machine, what arctic info reports, and the choice between the Go and DuckDB conversion engines.

Converting the bulk archive is bounded by CPU, memory, and disk. A small laptop and a large server should not run the same number of parallel conversions, so arctic detects the machine it is on and derives a work budget from it rather than assuming a fixed level of parallelism.

See the budget

arctic info

That reports the detected hardware and the budget arctic computed from it. The hardware fields:

Field	Meaning
`os`, `hostname`	The host arctic detected
`cpus`	Logical CPU count
`ram_total_gb`	Total system memory
`ram_available_gb`	Memory available right now
`disk_free_gb`	Free space on the work directory's disk

The budget fields derived from them:

Field	Meaning
`max_downloads`	How many dumps to download at once
`max_process`	How many files to convert at once
`max_convert_workers`	Worker threads inside one conversion
`duckdb_memory_mb`	Memory ceiling handed to the DuckDB engine
`sequential`	Whether the host is small enough to run strictly one step at a time

And the storage and engine fields:

Field	Meaning
`engine`	The active conversion engine (`go` or `duckdb`)
`duckdb_available`	Whether this binary was built with the DuckDB engine
`data_dir`, `raw_dir`, `work_dir`, `index_path`	The resolved storage paths

The sequential fallback

On a small host, running downloads and conversions in parallel competes for the same scarce memory and disk and ends up slower than doing one thing at a time. So the budget carries a sequential flag: when the detected memory or core count is low, arctic runs strictly one step after another. You do not set this; it follows from the hardware. A larger host runs the downloads and conversions concurrently up to the max_* caps.

Overriding the caps

--workers overrides the convert and download caps when you want to set the parallelism by hand, for example to leave headroom for other work or to push a capable machine harder:

arctic pull 2024-01..2024-03 --process --workers 4
arctic info --workers 4          # see the budget with the override applied

On the API path, --workers sets the fetch concurrency the same way. With --workers 0 (the default) arctic uses the computed budget.

The two conversion engines

arctic ships two engines for the JSONL-to-Parquet step.

The Go engine (--engine go, the default) is pure Go and built into every binary. It needs nothing installed alongside arctic and runs everywhere the binary does. This is the right default and the only engine in the standard release build.

The DuckDB engine (--engine duckdb) does the same conversion through DuckDB's read_json and COPY ... TO. It is faster on a capable host and respects the duckdb_memory_mb budget so it does not exhaust memory on a large file. It needs a binary built with -tags duckdb and a cgo toolchain:

make build-duckdb
./bin/arctic pull 2024-01..2024-03 --process --engine duckdb

A pure-Go binary asked for --engine duckdb exits with a usage error telling you to build with -tags duckdb, and arctic info shows duckdb_available: false, so you always know which build you are running. Both engines write the same fixed schema, so the choice is purely about speed and toolchain, and shards from one are indistinguishable from the other at query time.

Disk reality

The dumps are large: hundreds of gigabytes uncompressed for a wide range, and the .zst files plus the Parquet shards both live under the data directory while a run is in flight. arctic info reports disk_free_gb so you can check before a big pull, and arctic catalog --sizes reports the per-file download sizes. A publish --commit run clears its local Parquet after each successful commit by default to keep the footprint down; pass --keep to hold onto it. See configuration for splitting the raw and work trees onto different disks with --raw-dir and --work-dir.