Skip to content
arctic

v0.1.0

The first public release of arctic: the full command surface, the two acquisition paths, the Parquet conversion, the local index, and the arctic and shift libraries.

The first public release. arctic is a single binary that turns the bulk Reddit archive into queryable Parquet. It pulls the monthly torrent dumps and the Arctic Shift backfill API, decompresses the zstd JSONL, writes Parquet with a fixed schema, keeps a local index of what you hold, and can publish the shards to a Hugging Face dataset. The default build is pure Go with no runtime dependencies.

What you get

  • Pull the bulk dumps. arctic pull downloads the monthly RC_/RS_ files from the public torrent catalog, a month, a list, or a 2024-01..2024-03 range at a time, and with --process converts each completed file to Parquet as it lands. arctic catalog lists the months the catalog covers, with --sizes for the per-file download sizes.
  • Acquire one community or account. arctic sub pulls a community torrent-first and falls back to the Arctic Shift API, so it works whether or not the community is in a bundle; --api forces the API path. arctic user pulls one account through the API. Both take --kind, --after, --before, and --no-import.
  • Convert to Parquet. arctic process converts decompressed JSONL dumps into zstd-compressed Parquet shards with a fixed twelve-column schema for comments and fourteen for submissions, counting and skipping unparseable lines rather than aborting.
  • Query what you hold. arctic query scans the Parquet shards of an entity with filters on --author, --contains, --min-score, --after, --before, and --kind.
  • Summarize the index. arctic stats reports what you hold --by month, type, or subreddit without rescanning the Parquet, and sub info / user info report one entity's shards, rows, bytes, and date span.
  • Publish to Hugging Face. arctic publish runs the bulk pipeline over a month range and uploads the shards, with a dry run by default, the token from HF_TOKEN, a stats.csv ledger, and a resumable commit on exit 75.
  • Size the work to the machine. arctic info reports the detected hardware, the derived work budget, the active engine, and the storage paths.

The two acquisition paths

The bulk path downloads the monthly dumps from the public torrent catalog and is what pull and publish run. The entity path serves one subreddit or account: torrent-first for a community in the per-subreddit bundle, and the Arctic Shift API otherwise. Both feed the same Parquet store with the same fixed schema, so a query reads uniformly across everything you imported.

The conversion engines

The default conversion engine is pure Go and built into every binary. An optional DuckDB engine does the same conversion through read_json and COPY ... TO; select it with --engine duckdb on a binary built with -tags duckdb and a cgo toolchain. Both write the same schema, so the choice is about speed and toolchain.

The libraries

The acquisition, conversion, indexing, and publishing live in the arctic package, and the Arctic Shift client in shift, so you can build the archive from your own program without the CLI:

import "github.com/tamnd/arctic-cli/arctic"

cfg := arctic.DefaultConfig()
m, _ := arctic.ParseMonth("2024-01")
path, err := arctic.DownloadMonth(ctx, cfg, m, arctic.TypeComments, nil)
res, err := arctic.ProcessFile(ctx, cfg, path, arctic.TypeComments, "out", nil)

Independent and public-data only

arctic is an independent, open-source tool. It is not affiliated with, endorsed by, or sponsored by Reddit, Inc. It moves only public archive data: the monthly dumps seeded on Academic Torrents and the records served by the Arctic Shift backfill API.

Install

go install github.com/tamnd/arctic-cli/cmd/arctic@latest

Prebuilt archives for Linux, macOS, Windows, and FreeBSD, plus Linux packages (deb, rpm, apk), are on the release page. There is also a Homebrew cask and a Scoop entry:

brew install --cask tamnd/tap/arctic

The multi-arch container image is on GHCR:

docker run --rm ghcr.io/tamnd/arctic catalog

The default binary is pure Go (CGO_ENABLED=0) with no runtime dependencies.