Quick start

From an empty data directory to a real community imported, queried, and summarized, in a handful of commands.

This walks the core loop: see what the catalog covers, acquire one community, query the records back, and inspect what you hold. None of it needs an account or a token. The first run downloads and converts data, so give it a moment and a few gigabytes of free disk.

1. See what the catalog covers

arctic catalog

Each row is a month the bulk torrent catalog covers, with a flag for whether that month is in the single large bundle or its own per-month torrent. Add --sizes to fetch the per-file byte sizes from the catalog before you commit to a download:

arctic catalog --sizes

2. Acquire one community

arctic sub golang

This pulls r/golang's full history. arctic first checks the per-subreddit torrent bundle; if golang is in it, it downloads that file, and if not it streams the records from the Arctic Shift API. Either way it converts the result to Parquet and records the import in the local index. Narrow it to one record type or a date window:

arctic sub golang --kind submissions --after 2020-01-01

3. Query what you imported

arctic query golang --contains generics -n 20

query scans the Parquet shards of an entity you already imported. Filter by author, score, and date, and search a substring of the body, title, or selftext:

arctic query golang --author rob_pike --min-score 100
arctic query golang --kind submissions --after 2024-01-01 --contains "go 1.22"

4. Inspect one entity

arctic sub info golang

That reports what is imported locally for r/golang: the shard count, row count, byte size, and the date span, per type, read straight from the Parquet footers. There is a matching arctic user info <name> for accounts.

5. See the budget and paths

arctic info

That prints the detected hardware, the work budget arctic derives from it (MaxDownloads, MaxProcess, MaxConvertWorkers, DuckDBMemoryMB), the active engine, and the resolved storage paths.

A bulk pull

For a span of time across all of Reddit rather than one community, pull a month range and convert it as it lands:

arctic pull 2024-01..2024-03 --process
arctic stats --by month

stats then summarizes the local index: rows per month here, or --by type and --by subreddit for the other cuts.

Where to next

You have the core loop. From here:

Bulk pulls covers pull, month ranges, --process, and stats.
Subreddits and users goes deep on the sub and user paths and their info subcommands.
Querying covers every query filter and how the scan works.
Output formats covers the table, JSON, CSV, and template rendering.
Publishing covers publish and Hugging Face.
The hardware budget covers info and the engines.
The CLI reference lists every command and flag.