How to train Region2Vec interval embeddings
Region2Vec
is an unsupervised method for creating embeddings for genomic regions and region sets from a set of raw BED files. The program will first map all raw regions to a given universe (vocabulary) set. Then, it will construct sentences by concatenating regions from a BED file in random order. The generated sentences will be used for Region2Vec training using word2vec.
Usage
- Prepare a set of bed files in
src_folder
. [Optional] If only a subset of files will be used, specify a list of those files asfile_list
. By default, the program will use all the files in the folder to train a Region2Vec model. - Prepare a universe file
universe_file
. - Create a token folder which will be used to store tokenized files
dst_folder
. - Run the following command
For customized settings, please go and check the parameters used in
from geniml.tokenization import hard_tokenization from geniml.region2vec import region2vec src_folder = '/path/to/raw/bed/files' dst_folder = '/path/to/tokenized_files' universe_file = '/path/to/universe_file' # must run tokenization first status = hard_tokenization(src_folder, dst_folder, universe_file, 1e-9) if status: # if hard_tokenization is successful, then run Region2Vec training save_dir = '/path/to/training/results' region2vec(dst_folder, save_dir, num_shufflings=1000)
main.py
. For training a Region2Vec model, the parameters,init_lr
,window_size
,num_shufflings
,embedding_dim
, are frequently tuned in experiments.
For command line usage, type geniml region2vec --help
for details. We give a simple usage below
geniml region2vec
--token-folder /path/to/token/folder \
--save-dir ./region2vec_model \
--num-shuffle 10 \
--embed-dim 100 \
--context-len 50