How to use BEDSpace to jointly embed regions and metadata

Introduction

BEDspace is an application of the StarSpace model to genomic interval data, described in Gharavi et al. 2023. It allows us to train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, BEDspace solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string; suggesting new labels for database region sets; and retrieving database region sets similar to a query region set.

Installation

The bedspace module is installed with geniml. To ensure that everything is working correctly, run: python -c "from geniml import bedspace".

BEDspace operations

There are four main commands in bedspace:

bedspace preprocess: preprocesses a set of genomic interval regions and their associated metadata into a format that can be used by bedspace train.
bedspace train: trains a StarSpace model on the preprocessed data.
bedspace distances: computes distances between region sets in the trained model and metadata labels.
bedspace search: searches for the most similar region sets and metadata labels to a given query. Three scenarios for this command are described in the details.

These commands are accessed via the command line with genimtools bedspace <command>.

`bedspace preprocess`

The preprocess command will prepare a set of region sets and metadata labels for training. This includes things like adding the __label__ prefix to metadata labels, and converting the region sets into a format that can be used by StarSpace. The command takes in a set of region sets and metadata labels, and outputs a set of preprocessed region sets and metadata labels. The command can be run as follows:

geniml bedspace preprocess \
    --input <path to input region sets> \
    --metadata <path to input metadata labels> \
    --universe <path to universe file> \
    --labels <path to the labels file> \
    --output <path to output preprocessed region sets>

Input Description:

--input: Specifies the path to the folder containing the region sets.
--metadata: Specifies the path to the metadata file in CSV format. The CSV file should include a column for file_name and separate columns for each label. The file_name column contains the names of the region set files, and the label columns contain the corresponding labels for each region set.
--universe: Specifies the path to the universe file. The universe file contains the chromosome, start position, and end position for each region for region set tokenization.
--labels: Specifies the target labels as a single string containing labels separated by commas.

`bedspace train`

The train command will train a StarSpace model on the preprocessed region sets and metadata labels. It requires that you have run the preprocess command first. The train command takes in a set of preprocessed region sets and metadata labels, and outputs a trained StarSpace model. The command can be run as follows:

geniml bedspace train \
    --path-to-starspace <path to StarSpace executable> \
    --input <path to preprocessed region sets> \
    --output <path to output trained model> \
    --dim <dimension of embedding space> \
    --epochs <number of epochs to train for> \
    --lr <learning rate>

Input Description:

--path-to-starspace: Specifies the path to the StarSpace executable.
--input: Specifies the path to the preprocessed region sets file generated from the preprocess function. The file should be in TXT format.
--output: Specifies the path where the trained model will be saved. --dim: Sets the dimension of the vector for the region set and label embedding from the StarSpace model.
--epochs: Specifies the number of epochs to train the StartSpace model.
--lr: Sets the learning rate for the training process.

`bedspace distances`

The distances command will compute the distances between all of the region sets and metadata labels in the trained model. It requires that you have ran the train command first. The distances command takes in a trained StarSpace model, and outputs a set of distances between all of the region sets and metadata labels in the model. The command can be run as follows:

geniml bedspace distances \
    --input <path to trained model> \
    --metadata <path to input metadata labels> \
    --universe <path to universe file> \
    --labels <path to labels file> \
    --files <path to region sets> \
    --output <path to output distances>

Input Description:

--input: Specifies the path to the trained model generated by the bedspace train command.
--metadata: Specifies the path to the input metadata labels.
--universe: Specifies the path to the universe file used for test file tokenization.
--labels: Specifies the target labels as a single string containing labels separated by commas.
--files: Specifies the path to the new region sets.
--output: Specifies the path where the distances file between labels and files, as well as database files and new files, will be saved.

`bedspace search`

The search command requires that you have previously run the distances command. It also requires a query. To search, you must specify one of 3 scenarios when using the search command:

r2l (region-to-label): You have a query region set and want to find the most similar metadata labels,
l2r (label-to-region): You have a query metadata label and want to find the most similar region sets, and
r2f (region-to-region): You have a query region set and want to find the most similar region sets.

Example usage for each type are given below:

`r2l`

geniml bedspace search \
    -t lr2
    -d <path to distances> \
    -n <number of results to return> \
    path/to/regions.bed

`l2r`

geniml bedspace search \
    -t rl2
    -d <path to distances> \
    -n <number of results to return> \
    K562

`r2r`

geniml bedspace search \
    -t r2r
    -d <path to distances> \
    -n <number of results to return> \
    path/to/regions.bed

Input Description:

-t: Specifies the search type.
-d: Specifies the path to the distances file generated by the bedsapce distances command.
-n: Specifies the number of top results to return.