Guides

Task-oriented walkthroughs for the things people actually do with the bulk Reddit archive.

Each guide is built around a job rather than a command: pulling a span of monthly bulk dumps, acquiring one subreddit or account, querying the imported records, shaping the output, publishing to Hugging Face, and reading the hardware budget. They assume you have run the quick start.

Bulk pulls Pull a span of monthly bulk dumps from the torrent catalog, convert them to Parquet as they land, and summarize what you hold. Subreddits and users Acquire one community or one account: the torrent-first sub path, the API user path, the date and type filters, and the info subcommands. Querying Read imported records back out of the Parquet shards with filters on author, score, date, type, and a substring match. Output formats Render records as a table, JSON, JSONL, CSV, or TSV, narrow the columns, template each row, and script against the exit codes. Publishing Convert a month range and upload the Parquet shards to a Hugging Face dataset, with a dry run by default, a token from the environment, and a resumable commit. The hardware budget How arctic sizes its work to the machine, what arctic info reports, and the choice between the Go and DuckDB conversion engines.