Architecture¶
Overview¶
CLI (cobra)
|
Orchestrator Manages file-level parallelism
|
FileDownload Per-file state machine
|
ChunkDownloader HTTP Range requests + retries
|
Auth Manager Automatic token refresh
Package Structure¶
egafetch/
cmd/egafetch/main.go CLI entry point and command wiring
internal/
auth/
auth.go OAuth2 token management + refresh
credentials.go Credential storage (~/.egafetch/)
api/
client.go EGA API client (metadata + download)
types.go API response types
config/
config.go Persistent config file (~/.egafetch/config.yaml)
download/
orchestrator.go Parallel file coordination
file.go Single file state machine + adaptive sizing
chunk.go Chunk downloader with retries + throttling
merge.go Chunk merging into final file
state/
manifest.go Download manifest management
state.go Per-file state persistence
verify/
checksum.go MD5/SHA256 verification
ui/
progress.go Terminal progress bars
status.go Status display formatting
Orchestrator¶
The orchestrator manages file-level parallelism using a semaphore pattern:
- Receives a manifest (list of files to download)
- Launches one goroutine per file
- Each goroutine checks if the file is already complete before acquiring a semaphore slot
- Up to
--parallel-files(default 4) files download simultaneously - Uses
errgroupfor cancellation propagation -- if one file fails fatally, all are cancelled
File State Machine¶
Each file progresses through a deterministic state machine:
pending --> chunking --> downloading --> merging --> verifying --> complete
| |
v v
failed <---------+
| State | Description |
|---|---|
pending |
Initial state, chunks not yet created |
chunking |
Splitting file into chunk ranges |
downloading |
Actively downloading chunks in parallel |
merging |
Concatenating chunk files into the final output |
verifying |
Validating checksum (MD5/SHA256) and writing .md5 sidecar |
complete |
Download successful, .md5 written, chunks cleaned up |
failed |
Failed after retries; may be retried at file level |
State is persisted to disk after every transition. This means you can interrupt at any point and resume cleanly.
Chunk Downloader¶
Files are split into chunks (default 64 MB) and downloaded in parallel:
- Each chunk is assigned a byte range (
starttoend) - An HTTP Range request fetches exactly those bytes
- The response is streamed to a
.partfile - If a
.partfile already has bytes on disk, the Range header starts from the existing size (resume) - If the server returns HTTP 200 instead of 206 (ignoring the Range header), the existing file is truncated to prevent data corruption
- On completion, the chunk state is marked
complete
Retry logic: Up to 5 retries per chunk with exponential backoff (1s base, 60s max) plus random jitter (0-1000ms).
Input Handling¶
CLI arguments are processed through expandArgs, which supports three input types:
- Dataset IDs (
EGAD...) -- expanded via the EGA API into all files in the dataset - File IDs (
EGAF...) -- fetched individually via the EGA metadata API - Identifier files -- any argument not starting with
EGAD/EGAFis read as a text file with one identifier per line (blank lines and#comments are ignored)
Disk Layout¶
./output-dir/
.egafetch/
manifest.json File list and dataset info
state/
EGAF00001104661.json Per-file state (status, chunks, progress)
chunks/
EGAF00001104661/
000.part Temporary chunk files
001.part
002.part
EGAF00001104661/
SLX-9630.A006.bwa.bam Completed file (after merge + verify)
SLX-9630.A006.bwa.bam.md5 MD5 checksum sidecar (standard md5sum format)
After each file is downloaded, merged, and verified, an MD5 checksum file is written alongside the output file. This .md5 file uses standard md5sum format and can be verified with md5sum -c.
All JSON state files are written atomically (temp file + fsync + rename) to prevent corruption on crashes.
Authentication Flow¶
EGAfetch uses two separate OAuth2 Identity Providers:
Download API (ega.ebi.ac.uk:8443):
grant_type=passwordwith EGA OIDC client credentials- Tokens last ~1 hour
- Auto-refreshed 5 minutes before expiry using the refresh token
- Stored in
~/.egafetch/credentials.json
Metadata API (idp.ega-archive.org):
- Separate IdP with
client_id=metadata-api - Tokens last 300 seconds
- Not persisted (fetched on-demand for metadata commands)
Dependencies¶
| Package | Purpose |
|---|---|
github.com/spf13/cobra |
CLI framework |
golang.org/x/sync |
errgroup for goroutine coordination |
golang.org/x/term |
Hidden password input |
golang.org/x/time |
rate.Limiter for bandwidth throttling |
gopkg.in/yaml.v3 |
YAML config file parsing |