Architecture¶

Overview¶

CLI (cobra)
    |
Orchestrator              Manages file-level parallelism
    |
FileDownload              Per-file state machine
    |
ChunkDownloader           HTTP Range requests + retries
    |
Auth Manager              Automatic token refresh

Package Structure¶

egafetch/
  cmd/egafetch/main.go       CLI entry point and command wiring
  internal/
    auth/
      auth.go                 OAuth2 token management + refresh
      credentials.go          Credential storage (~/.egafetch/)
    api/
      client.go               EGA API client (metadata + download)
      types.go                API response types
    config/
      config.go               Persistent config file (~/.egafetch/config.yaml)
    download/
      orchestrator.go         Parallel file coordination
      file.go                 Single file state machine + adaptive sizing
      chunk.go                Chunk downloader with retries + throttling
      merge.go                Chunk merging into final file
    state/
      manifest.go             Download manifest management
      state.go                Per-file state persistence
    verify/
      checksum.go             MD5/SHA256 verification
    ui/
      progress.go             Terminal progress bars
      status.go               Status display formatting

Orchestrator¶

The orchestrator manages file-level parallelism using a semaphore pattern:

Receives a manifest (list of files to download)
Launches one goroutine per file
Each goroutine checks if the file is already complete before acquiring a semaphore slot
Up to --parallel-files (default 4) files download simultaneously
Uses errgroup for cancellation propagation -- if one file fails fatally, all are cancelled

File State Machine¶

Each file progresses through a deterministic state machine:

pending --> chunking --> downloading --> merging --> verifying --> complete
                            |              |
                            v              v
                          failed <---------+

State	Description
`pending`	Initial state, chunks not yet created
`chunking`	Splitting file into chunk ranges
`downloading`	Actively downloading chunks in parallel
`merging`	Concatenating chunk files into the final output
`verifying`	Validating checksum (MD5/SHA256) and writing `.md5` sidecar
`complete`	Download successful, `.md5` written, chunks cleaned up
`failed`	Failed after retries; may be retried at file level

State is persisted to disk after every transition. This means you can interrupt at any point and resume cleanly.

Chunk Downloader¶

Files are split into chunks (default 64 MB) and downloaded in parallel:

Each chunk is assigned a byte range (start to end)
An HTTP Range request fetches exactly those bytes
The response is streamed to a .part file
If a .part file already has bytes on disk, the Range header starts from the existing size (resume)
If the server returns HTTP 200 instead of 206 (ignoring the Range header), the existing file is truncated to prevent data corruption
On completion, the chunk state is marked complete

Retry logic: Up to 5 retries per chunk with exponential backoff (1s base, 60s max) plus random jitter (0-1000ms).

Input Handling¶

CLI arguments are processed through expandArgs, which supports three input types:

Dataset IDs (EGAD...) -- expanded via the EGA API into all files in the dataset
File IDs (EGAF...) -- fetched individually via the EGA metadata API
Identifier files -- any argument not starting with EGAD/EGAF is read as a text file with one identifier per line (blank lines and # comments are ignored)

Disk Layout¶

./output-dir/
    .egafetch/
        manifest.json              File list and dataset info
        state/
            EGAF00001104661.json   Per-file state (status, chunks, progress)
        chunks/
            EGAF00001104661/
                000.part           Temporary chunk files
                001.part
                002.part
    EGAF00001104661/
        SLX-9630.A006.bwa.bam     Completed file (after merge + verify)
        SLX-9630.A006.bwa.bam.md5 MD5 checksum sidecar (standard md5sum format)

After each file is downloaded, merged, and verified, an MD5 checksum file is written alongside the output file. This .md5 file uses standard md5sum format and can be verified with md5sum -c.

All JSON state files are written atomically (temp file + fsync + rename) to prevent corruption on crashes.

Authentication Flow¶

EGAfetch uses two separate OAuth2 Identity Providers:

Download API (ega.ebi.ac.uk:8443):

grant_type=password with EGA OIDC client credentials
Tokens last ~1 hour
Auto-refreshed 5 minutes before expiry using the refresh token
Stored in ~/.egafetch/credentials.json

Metadata API (idp.ega-archive.org):

Separate IdP with client_id=metadata-api
Tokens last 300 seconds
Not persisted (fetched on-demand for metadata commands)

Dependencies¶

Package	Purpose
`github.com/spf13/cobra`	CLI framework
`golang.org/x/sync`	`errgroup` for goroutine coordination
`golang.org/x/term`	Hidden password input
`golang.org/x/time`	`rate.Limiter` for bandwidth throttling
`gopkg.in/yaml.v3`	YAML config file parsing