How to upload GEO data to bedbase
BEDboss provides geo submodule with functionality to upload GEO data to bedbase.
$ bedboss geo --help
Usage: bedboss geo [OPTIONS] COMMAND [ARGS]...
Automatic BEDbase uploader for GEO data
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --version -v App version │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ upload-all Run bedboss uploading pipeline for specified genome in specified period of time. │
│ upload-gse Run bedboss uploading pipeline for GSE. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
It has two main commands:
-
🟢 upload-all: Runs the BEDboss uploading pipeline for a specified genome within a given time period.
-
🟢 upload-gse: Uploads or reprocesses a specific GSE accession.
How does it work?
- First, PEPhub automatically, every day (using a GitHub Actions cron job), uploads the metadata of GEO projects toEach PEP corresponds to a GEO project (GSE) with all samples containing narrowPeak, broadPeak, or BED files. These files are not quality-checked and may include incorrectly formatted files that users labeled as BED-like.
- Next, the GEO uploader retrieves all PEPs from PEPhub for a specific time period (e.g., the last month) or a specific GSE.
upload-all
: fetches all PEPs for the given time period.upload-gse
: fetches a specific GSE.
Then, for each sample in the PEP, the bedboss-all function processes the files.
Info
BEDboss upload-all supports two upload modes: - Full mode: Downloads, upload and processes file if it can. - Lite mode: Downloads and uploads file if it can, but does not process it. This mode is useful if you want to upload a large number of files quickly, and process them later.
For reprocessing files, there is a separate command: bedboss reprocess-all
and bedboss reprocess-one
.
- When a GSM (sample) is processed, it is flagged in the database as processed, so it will not be processed again in the future. In addition to the metadata from GEO, we also store the processing time and file digest in the database. The same applies to GSEs (projects): if a GSE or PEP in PEPhub has been processed and the reprocess flag is not set, the project will not be processed again.
Full CLI docs:
bedboss geo upload-all --help
Usage: bedboss geo upload-all [OPTIONS]
Run bedboss uploading pipeline for specified genome in specified period of time.
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --bedbase-config TEXT Path to bedbase config file [default: None] [required] │
│ * --outfolder TEXT Path to output folder [default: None] [required] │
│ --start-date TEXT The earliest date when opep was updated [Default: │
│ 2000/01/01] │
│ [default: None] │
│ --end-date TEXT The latest date when opep was updated [Default: today's │
│ date] │
│ [default: None] │
│ --search-limit INTEGER Limit of projects to be searched. [Default: 50] │
│ [default: 50] │
│ --search-offset INTEGER Limit of projects to be searched. [Default: 0] │
│ [default: 0] │
│ --download-limit INTEGER Limit of projects to be downloaded [Default: 100] │
│ [default: 100] │
│ --genome TEXT Reference genome [Default: None] (e.g. hg38) - if None, │
│ all genomes will be processed │
│ [default: None] │
│ --preload --no-preload Download bedfile before caching it. [Default: True] │
│ [default: preload] │
│ --create-bedset --no-create-bedset Create bedset from bed files. [Default: True] │
│ [default: create-bedset] │
│ --overwrite --no-overwrite Overwrite existing bedfiles. [Default: False] │
│ [default: no-overwrite] │
│ --overwrite-bedset --no-overwrite-bedset Overwrite existing bedset. [Default: False] │
│ [default: overwrite-bedset] │
│ --rerun --no-rerun Re-run all the samples. [Default: False] │
│ [default: no-rerun] │
│ --run-skipped --no-run-skipped Run skipped projects. [Default: False] │
│ [default: run-skipped] │
│ --run-failed --no-run-failed Run failed projects. [Default: False] │
│ [default: run-failed] │
│ --standardize-pep --no-standardize-pep Standardize pep with BEDMESS. [Default: False] │
│ [default: no-standardize-pep] │
│ --use-skipper --no-use-skipper Use skipper to skip projects if they were processed │
│ locally [Default: False] │
│ [default: no-use-skipper] │
│ --reinit-skipper --no-reinit-skipper Reinitialize skipper. [Default: False] │
│ [default: no-reinit-skipper] │
│ --lite --no-lite Run the pipeline in lite mode. [Default: False] │
│ [default: no-lite] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
bedboss geo upload-gse --help
Usage: bedboss geo upload-gse [OPTIONS]
Run bedboss uploading pipeline for GSE.
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --bedbase-config TEXT Path to bedbase config file [default: None] [required] │
│ * --outfolder TEXT Path to output folder [default: None] [required] │
│ * --gse TEXT GSE number that can be found in pephub. eg. GSE123456 │
│ [default: None] │
│ [required] │
│ --create-bedset --no-create-bedset Create bedset from bed files. [Default: True] │
│ [default: create-bedset] │
│ --genome TEXT reference genome to upload to database. If None, all │
│ genomes will be processed │
│ [default: None] │
│ --preload --no-preload Download bedfile before caching it. [Default: True] │
│ [default: preload] │
│ --rerun --no-rerun Re-run all the samples. [Default: False] [default: rerun] │
│ --run-skipped --no-run-skipped Run skipped projects. [Default: False] │
│ [default: run-skipped] │
│ --run-failed --no-run-failed Run failed projects. [Default: False] │
│ [default: run-failed] │
│ --overwrite --no-overwrite Overwrite existing bedfiles. [Default: False] │
│ [default: no-overwrite] │
│ --overwrite-bedset --no-overwrite-bedset Overwrite existing bedset. [Default: False] │
│ [default: overwrite-bedset] │
│ --standardize-pep --no-standardize-pep Standardize pep with BEDMESS. [Default: False] │
│ [default: no-standardize-pep] │
│ --use-skipper --no-use-skipper Use local skipper to skip projects if they were processed │
│ locally [Default: False] │
│ [default: no-use-skipper] │
│ --reinit-skipper --no-reinit-skipper Reinitialize skipper. [Default: False] │
│ [default: no-reinit-skipper] │
│ --lite --no-lite Run the pipeline in lite mode. [Default: False] │
│ [default: no-lite] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯