BEDbase Data Loading
This document describes how BEDbase stays up to date with new BED files from public repositories.
Data Loading Steps
- GEO publishes new data â Researchers deposit BED files (ChIP-seq peaks, ATAC-seq regions, etc.) to NCBI GEO
- geopephub scans GEO â Every 2 days, pepkit/geopephub GitHub Actions scans GEO for new projects containing BED-like files
- geopephub uploads to PEPhub â Matching projects are uploaded as PEPs to the
bedbasenamespace in PEPhub - bedbase-loader queries PEPhub â Daily, databio/bedbase-loader GitHub Actions queries PEPhub for recently updated projects
- Light processing runs â Files are downloaded, validated, and inserted into PostgreSQL/S3 (fast, no heavy computation)
- Heavy processing runs â AWS Fargate runs statistics, generates embeddings, and indexes to Qdrant (slow, compute-intensive)
Architecture Diagram
```mermaid
flowchart TB
subgraph Sources
GEO[("1. GEO (NCBI)
Source Data")]
end
subgraph PEPhubSystem["PEPhub System (pepkit)"]
geopephub["2. geopephub<br/>(GitHub Actions)"]
PEPhub[("3. PEPhub DB<br/>bedbase namespace")]
end
subgraph BEDbaseAutomation["BEDbase Automation (databio)"]
Loader["4. bedbase-loader<br/>(GitHub Actions)"]
end
subgraph Infrastructure["BEDbase Infrastructure"]
direction LR
Postgres[("PostgreSQL")]
Qdrant[("Qdrant")]
S3[("S3")]
end
subgraph HeavyProcessing
Fargate["6. AWS Fargate"]
end
GEO -->|"scan for BED files"| geopephub
geopephub -->|"upload PEPs"| PEPhub
PEPhub -->|"query by date"| Loader
Loader -->|"5. Light processing"| Infrastructure
Fargate -->|"Heavy processing"| Infrastructure
```
Data Sources
GEO (Gene Expression Omnibus)
GEO is the primary source of BED files. NCBI's GEO database contains genomic datasets including ChIP-seq peaks (narrowPeak, broadPeak), ATAC-seq accessibility regions, DNase-seq hypersensitivity sites, and other genomic interval data.
PEPhub and the bedbase Namespace
PEPhub (https://pephub.databio.org) serves as a metadata intermediary between GEO and BEDbase. The key component is the bedbase namespace â a curated subset of GEO containing only projects with BED-like files.
How the bedbase Namespace is Populated
The bedbase namespace is populated by geopephub, a tool maintained by the pepkit organization (separate from bedbase). This runs as a GitHub Actions workflow in the geopephub repository:
Workflow: bedbase_uploader.yml (runs every 2 days at 10:00 UTC)
geopephub run-queuer --target bedbase --period 2
geopephub run-uploader --target bedbase
What geopephub does:
- Queuer â Scans GEO for new projects containing BED-like files (narrowPeak, broadPeak, BED) using geofetch's Finder module
- Uploader â Downloads metadata via GEOfetch and uploads PEPs to PEPhub under the
bedbasenamespace - Checker â Verifies upload success and retries failures
This means:
- The bedbase namespace is not managed by the bedbase project itself
- It's managed by the pepkit organization via geopephub
- BEDbase (via bedbase-loader) is a consumer of this namespace, not the producer
PEPhub Namespaces
| Namespace | URL | Contents | Updated by |
|---|---|---|---|
geo |
https://pephub.databio.org/geo | All GEO projects (~99% of GEO) | geofetch (weekly) |
bedbase |
https://pephub.databio.org/bedbase | BED-file projects only | geopephub (every 2 days) |
bedbase-loader Repository
The databio/bedbase-loader repository contains the GitHub Actions workflows that pull data from the bedbase namespace and load it into BEDbase infrastructure.
| Workflow | Schedule | Purpose |
|---|---|---|
upload_cron.yml |
Daily 18:00 UTC | Fetch new GEO samples from last 3 days |
upload_cron_series.yml |
Daily 18:00 UTC | Fetch new GEO series from last 3 days |
update_genomes.yml |
Weekly (Tuesdays) | Sync genome info from Refgenie |
update_umap.yml |
Manual | Regenerate UMAP visualizations |
Configuration: See config.yaml in the bedbase-loader repo.
Phase 1: Light Processing (GitHub Actions)
The daily workflows run bedboss geo upload-all with the --lite flag:
bedboss geo upload-all \
--start-date <3 days ago> --end-date <today> \
--geo-tag samples --lite \
--bedbase-config config.yaml --outfolder /tmp/out
What light processing does:
- Query PEPhub for GSE projects updated in the time window
- Skip already-processed projects (via database flags)
- Download metadata and BED files
- Validate file format and genome
- Insert records to PostgreSQL and upload files to S3
- Flag project as ready for heavy processing
Phase 2: Heavy Processing (AWS Fargate)
After light processing, heavy processing runs on AWS Fargate via a scheduled task using the bedboss Docker image:
bedboss reprocess-all --bedbase-config config.yaml --outfolder /tmp/out --limit 100
What heavy processing does:
- Quality control (bedqc) â validate file integrity, region counts, region widths
- Statistics (bedstat) â calculate GC content, TSS distances, genomic feature percentages
- Embeddings (region2vec) â generate vector embeddings for semantic search
- Qdrant indexing â upload embeddings to vector database
- Flag file as fully processed
Infrastructure
BEDbase uses: PostgreSQL (metadata, processing status), Qdrant (vector embeddings), and S3-compatible storage (files). See BEDbase Configuration for details.
Related Documentation
- databio/bedbase-loader â Automation repo with workflows and config
- databio/bedboss â Processing pipeline CLI
- pepkit/geopephub â Populates the
bedbasenamespace in PEPhub - BEDboss CLI Reference â Full command documentation
- How to Upload GEO Data â Manual upload guide
- BEDbase Configuration â Config file format