Skip to content

EGAfetch

Fast, parallel, resumable download of data and metadata from the European Genome-phenome Archive (EGA).

EGAfetch is a GO-based command-line tool an alternative to pyEGA3 with significantly faster downloads, automatic resume, and robust error handling.


Why EGAfetch?

Downloading large genomic datasets from EGA with pyEGA3 is slow, fragile, and requires manual inspection. EGAfetch solves this:

  • Parallel downloads -- multiple files and multiple chunks per file downloaded simultaneously
  • Automatic resume -- interrupted downloads pick up exactly where they stopped
  • Checksum verification -- MD5/SHA256 verified after every file
  • Token auto-refresh -- OAuth2 tokens refreshed transparently before expiry
  • Retry with backoff -- exponential backoff with jitter on transient failures
  • Metadata export -- download dataset metadata as TSV, CSV, or JSON (auto-fetched during download)
  • Bandwidth throttling -- cap total bandwidth with --max-bandwidth for shared HPC networks
  • Config file -- persist defaults in ~/.egafetch/config.yaml
  • File filtering -- selectively download with --include/--exclude glob patterns
  • Adaptive chunk sizing -- auto-tune chunk size based on throughput with --adaptive-chunks
  • Batch file input -- pass a text file with identifiers (one per line, # comments supported)
  • MD5 sidecar files -- .md5 checksum file written alongside each downloaded file
  • Single binary -- no Python, no pip, no dependencies; works on HPC clusters

Quick Example

# Log in
egafetch auth login --cf credentials.json

# Download an entire dataset with parallel chunked downloads
egafetch download EGAD00001001938 -o ./data

# Interrupted? Just re-run -- it resumes automatically
egafetch download EGAD00001001938 -o ./data

# Metadata is auto-fetched during download (when using --cf)
# Or export independently:
egafetch metadata EGAD00001001938 --cf credentials.json

At a Glance

pyEGA3 EGAfetch
Parallel files 1 Configurable (default 4)
Parallel chunks 1 Configurable (default 8)
Resume Limited Full (chunk-level, byte-precise)
Token refresh Manual Automatic
Bandwidth throttling No --max-bandwidth
File filtering No --include / --exclude globs
Adaptive chunks No --adaptive-chunks
Persistent config No ~/.egafetch/config.yaml
Installation pip install Single binary
Batch file input No Text file with identifiers
MD5 sidecar files No .md5 per file
Metadata export No TSV / CSV / JSON (auto during download)

Ready to get started? Head to the Installation guide.