BEDbase Data Loading

This document describes how BEDbase stays up to date with new BED files from public repositories.

Data Loading Steps

GEO publishes new data — Researchers deposit BED files (ChIP-seq peaks, ATAC-seq regions, etc.) to NCBI GEO
geopephub scans GEO — Every 2 days, pepkit/geopephub GitHub Actions scans GEO for new projects containing BED-like files
geopephub uploads to PEPhub — Matching projects are uploaded as PEPs to the bedbase namespace in PEPhub
bedbase-loader queries PEPhub — Daily, databio/bedbase-loader GitHub Actions queries PEPhub for recently updated projects
Light processing runs — Files are downloaded, validated, and inserted into PostgreSQL/S3 (fast, no heavy computation)
Heavy processing runs — AWS Fargate runs statistics, generates embeddings, and indexes to Qdrant (slow, compute-intensive)

Architecture Diagram

flowchart TB subgraph Sources GEO[("1. GEO (NCBI)<br/>Source Data")] end subgraph PEPhubSystem["PEPhub System (pepkit)"] geopephub["2. geopephub<br/>(GitHub Actions)"] PEPhub[("3. PEPhub DB<br/>bedbase namespace")] end subgraph BEDbaseAutomation["BEDbase Automation (databio)"] Loader["4. bedbase-loader<br/>(GitHub Actions)"] end subgraph Infrastructure["BEDbase Infrastructure"] direction LR Postgres[("PostgreSQL")] Qdrant[("Qdrant")] S3[("S3")] end subgraph HeavyProcessing Fargate["6. AWS Fargate"] end GEO -->|"scan for BED files"| geopephub geopephub -->|"upload PEPs"| PEPhub PEPhub -->|"query by date"| Loader Loader -->|"5. Light processing"| Infrastructure Fargate -->|"Heavy processing"| Infrastructure

Data Sources

GEO (Gene Expression Omnibus)

GEO is the primary source of BED files. NCBI's GEO database contains genomic datasets including ChIP-seq peaks (narrowPeak, broadPeak), ATAC-seq accessibility regions, DNase-seq hypersensitivity sites, and other genomic interval data.

PEPhub and the `bedbase` Namespace

PEPhub (https://pephub.databio.org) serves as a metadata intermediary between GEO and BEDbase. The key component is the bedbase namespace — a curated subset of GEO containing only projects with BED-like files.

How the `bedbase` Namespace is Populated

The bedbase namespace is populated by geopephub, a tool maintained by the pepkit organization (separate from bedbase). This runs as a GitHub Actions workflow in the geopephub repository:

Workflow: bedbase_uploader.yml (runs every 2 days at 10:00 UTC)

geopephub run-queuer --target bedbase --period 2
geopephub run-uploader --target bedbase

What geopephub does:

Queuer — Scans GEO for new projects containing BED-like files (narrowPeak, broadPeak, BED) using geofetch's Finder module
Uploader — Downloads metadata via GEOfetch and uploads PEPs to PEPhub under the bedbase namespace
Checker — Verifies upload success and retries failures

This means: - The bedbase namespace is not managed by the bedbase project itself - It's managed by the pepkit organization via geopephub - BEDbase (via bedbase-loader) is a consumer of this namespace, not the producer

PEPhub Namespaces

Namespace	URL	Contents	Updated by
`geo`	https://pephub.databio.org/geo	All GEO projects (~99% of GEO)	geofetch (weekly)
`bedbase`	https://pephub.databio.org/bedbase	BED-file projects only	geopephub (every 2 days)

bedbase-loader Repository

The databio/bedbase-loader repository contains the GitHub Actions workflows that pull data from the bedbase namespace and load it into BEDbase infrastructure.

Workflow	Schedule	Purpose
`upload_cron.yml`	Daily 18:00 UTC	Fetch new GEO samples from last 3 days
`upload_cron_series.yml`	Daily 18:00 UTC	Fetch new GEO series from last 3 days
`update_genomes.yml`	Weekly (Tuesdays)	Sync genome info from Refgenie
`update_umap.yml`	Manual	Regenerate UMAP visualizations

Configuration: See config.yaml in the bedbase-loader repo.

Phase 1: Light Processing (GitHub Actions)

The daily workflows run bedboss geo upload-all with the --lite flag:

bedboss geo upload-all \
  --start-date <3 days ago> --end-date <today> \
  --geo-tag samples --lite \
  --bedbase-config config.yaml --outfolder /tmp/out

What light processing does:

Query PEPhub for GSE projects updated in the time window
Skip already-processed projects (via database flags)
Download metadata and BED files
Validate file format and genome
Insert records to PostgreSQL and upload files to S3
Flag project as ready for heavy processing

Phase 2: Heavy Processing (AWS Fargate)

After light processing, heavy processing runs on AWS Fargate via a scheduled task using the bedboss Docker image:

bedboss reprocess-all --bedbase-config config.yaml --outfolder /tmp/out --limit 100

What heavy processing does:

Quality control (bedqc) — validate file integrity, region counts, region widths
Statistics (bedstat) — calculate GC content, TSS distances, genomic feature percentages
Embeddings (region2vec) — generate vector embeddings for semantic search
Qdrant indexing — upload embeddings to vector database
Flag file as fully processed

Infrastructure

BEDbase uses: PostgreSQL (metadata, processing status), Qdrant (vector embeddings), and S3-compatible storage (files). See BEDbase Configuration for details.

databio/bedbase-loader — Automation repo with workflows and config
databio/bedboss — Processing pipeline CLI
pepkit/geopephub — Populates the bedbase namespace in PEPhub
BEDboss CLI Reference — Full command documentation
How to Upload GEO Data — Manual upload guide
BEDbase Configuration — Config file format