How to train a new Region2Vec model
Region2Vec is an unsupervised method for creating embeddings of genomic regions and genomic region sets. This tutorial discusses how to train a new model. To learn how to use a pre-trained model see the region2vec usage documentation.
Training data and universe
Training a model requires two things: 1) a set of pre-tokenized data and 2) a universe. The universe is a set of regions that the model will be trained on. The universe is used to create the tokenizer, which is used to convert the raw data into tokens. The universe should be representative of the data that you will be training the model on. For example, if you are training a model on human data, you should use a universe that contains human regions. If you dont have a universe, a safe bet is to use the 1000 tiles hg38 genome.
You can download the 1000 tiles hg38 genome here.
The pre-tokenized data is a set of .gtok
files. These are binary files that contain the tokenized data in the form of integers. The .gtok
files are used directly to train the model. If you have not pre-tokenized your data, see the pre-tokenization documentation.
Training a model
Instantiate a new model
To begin, create a new model from Region2VecExModel
.
Note: We use the
Region2VecExModel
because it is an extension of theRegion2Vec
class that comes with its own tokenizer. This ensures that the model and tokenizer are compatible.
import logging
import os
from multiprocessing import cpu_count
from geniml.io import RegionSet
from geniml.tokenization import TreeTokenizer
from geniml.region2vec import Region2VecExModel
from rich.progress import track
logging.basicConfig(level=logging.INFO)
# get the paths to data
universe_path = os.path.expandvars("$RESOURCES/regions/genome_tiles/tiles1000.hg38.bed")
data_path = os.path.expandvars("$DATA/ChIP-Atlas/hg38/ATAC_seq/tokens")
model = Region2VecExModel(
tokenizer=TreeTokenizer(universe_path),
)
Training data
The training data is a set of .gtok
files. You can use the Region2VecDataset
class to load the .gtok
files and train the model.
from geniml.region2vec.utils import Region2VecDataset
dataset = Region2VecDataset(data_path)
Training
Now, simply give the model the list of RegionSet
s and run the training:
# train the model
model.train(dataset, epochs=100)
You can export your model using the export
function:
model.export("out")
These files are intended to be directly uploaded to huggingface. You can upload them using the huggingface-cli
or the huggingface website.