Publishing
Convert a month range and upload the Parquet shards to a Hugging Face dataset, with a dry run by default, a token from the environment, and a resumable commit.
publish runs the bulk pipeline over a month range and uploads the resulting
Parquet shards to a Hugging Face dataset. It is the
bulk pull plus an upload, wrapped so a long unattended run can rehearse, resume,
and leave a ledger of what it committed.
The token
The upload reads the token from the HF_TOKEN environment variable. Export a
write token from your Hugging Face account before a real commit:
export HF_TOKEN=hf_...
A --commit run with no HF_TOKEN set exits with a usage error rather than
starting work it cannot finish.
Dry run by default
Without --commit, publish does everything except the upload: it processes the
months into Parquet and reports what it would push. This is the way to rehearse a
run and confirm the range and the row counts before you spend the bandwidth:
arctic publish --from 2024-01 --to 2024-03
Add --commit to actually upload:
arctic publish --from 2024-01 --to 2024-03 --commit
The month range and types
--from and --to bound the range and default to the catalog start and end, the
same as pull. --type narrows to comments, submissions, or both:
arctic publish --from 2024-01 --to 2024-06 --type submissions --commit
The dataset repo
By default publish uploads to arctic's default dataset repo. Point it at your
own with --repo, and create it as private with --private:
arctic publish --from 2024-01 --to 2024-03 --repo your-name/reddit-archive --commit
arctic publish --from 2024-01 --to 2024-03 --repo your-name/reddit-archive --private --commit
The stats.csv ledger and resuming
A publish run records what it committed in a stats.csv ledger in the dataset.
That ledger is the record of which months have landed, so a re-run knows where to
pick up.
If a commit stalls, publish exits with code 75 rather than hanging or pretending
it finished. A supervisor (a shell loop, a systemd unit, a CI job) can treat code
75 as "restart me" and run the same command again; the next run reads the
stats.csv ledger and resumes from the last committed month instead of starting
over. A simple supervisor loop:
export HF_TOKEN=hf_...
until arctic publish --from 2024-01 --to 2024-12 --commit; do
[ $? -eq 75 ] || break # only retry on a stalled commit
echo "commit stalled, resuming"
done
Keeping the local Parquet
After a successful commit, publish clears the local Parquet it produced for the
run by default, since the canonical copy now lives in the dataset. Pass --keep
to leave the local shards in place, which is what you want if you also intend to
query them locally:
arctic publish --from 2024-01 --to 2024-03 --commit --keep
Sizing the run
A publish run carries the same conversion cost as a bulk pull, so it is bounded by
your hardware. arctic sizes the parallelism from the detected machine; override
the caps with --workers and pick the engine with --engine go or
--engine duckdb. See the hardware budget.
What the exit codes mean here
| Code | Meaning for publish |
|---|---|
0 |
The range processed and (with --commit) uploaded |
2 |
Usage error, including --commit without HF_TOKEN |
3 |
No data: no month in the range was published in the catalog |
5 |
Blocked: a source rate-limited or refused |
75 |
A commit stalled; restart to resume from the ledger |
See troubleshooting for the full table.