v0.1.0
The first public release of arctic: the full command surface, the two acquisition paths, the Parquet conversion, the local index, and the arctic and shift libraries.
The first public release. arctic is a single binary that turns the bulk Reddit archive into queryable Parquet. It pulls the monthly torrent dumps and the Arctic Shift backfill API, decompresses the zstd JSONL, writes Parquet with a fixed schema, keeps a local index of what you hold, and can publish the shards to a Hugging Face dataset. The default build is pure Go with no runtime dependencies.
What you get
- Pull the bulk dumps.
arctic pulldownloads the monthlyRC_/RS_files from the public torrent catalog, a month, a list, or a2024-01..2024-03range at a time, and with--processconverts each completed file to Parquet as it lands.arctic cataloglists the months the catalog covers, with--sizesfor the per-file download sizes. - Acquire one community or account.
arctic subpulls a community torrent-first and falls back to the Arctic Shift API, so it works whether or not the community is in a bundle;--apiforces the API path.arctic userpulls one account through the API. Both take--kind,--after,--before, and--no-import. - Convert to Parquet.
arctic processconverts decompressed JSONL dumps into zstd-compressed Parquet shards with a fixed twelve-column schema for comments and fourteen for submissions, counting and skipping unparseable lines rather than aborting. - Query what you hold.
arctic queryscans the Parquet shards of an entity with filters on--author,--contains,--min-score,--after,--before, and--kind. - Summarize the index.
arctic statsreports what you hold--by month,type, orsubredditwithout rescanning the Parquet, andsub info/user inforeport one entity's shards, rows, bytes, and date span. - Publish to Hugging Face.
arctic publishruns the bulk pipeline over a month range and uploads the shards, with a dry run by default, the token fromHF_TOKEN, astats.csvledger, and a resumable commit on exit 75. - Size the work to the machine.
arctic inforeports the detected hardware, the derived work budget, the active engine, and the storage paths.
The two acquisition paths
The bulk path downloads the monthly dumps from the public torrent catalog and is
what pull and publish run. The entity path serves one subreddit or account:
torrent-first for a community in the per-subreddit bundle, and the Arctic Shift
API otherwise. Both feed the same Parquet store with the same fixed schema, so a
query reads uniformly across everything you imported.
The conversion engines
The default conversion engine is pure Go and built into every binary. An optional
DuckDB engine does the same conversion through read_json and COPY ... TO;
select it with --engine duckdb on a binary built with -tags duckdb and a cgo
toolchain. Both write the same schema, so the choice is about speed and
toolchain.
The libraries
The acquisition, conversion, indexing, and publishing live in the arctic
package, and the Arctic Shift client in shift, so you can build the archive from
your own program without the CLI:
import "github.com/tamnd/arctic-cli/arctic"
cfg := arctic.DefaultConfig()
m, _ := arctic.ParseMonth("2024-01")
path, err := arctic.DownloadMonth(ctx, cfg, m, arctic.TypeComments, nil)
res, err := arctic.ProcessFile(ctx, cfg, path, arctic.TypeComments, "out", nil)
Independent and public-data only
arctic is an independent, open-source tool. It is not affiliated with, endorsed by, or sponsored by Reddit, Inc. It moves only public archive data: the monthly dumps seeded on Academic Torrents and the records served by the Arctic Shift backfill API.
Install
go install github.com/tamnd/arctic-cli/cmd/arctic@latest
Prebuilt archives for Linux, macOS, Windows, and FreeBSD, plus Linux packages (deb, rpm, apk), are on the release page. There is also a Homebrew cask and a Scoop entry:
brew install --cask tamnd/tap/arctic
The multi-arch container image is on GHCR:
docker run --rm ghcr.io/tamnd/arctic catalog
The default binary is pure Go (CGO_ENABLED=0) with no runtime dependencies.