How to tokenize a BED file on the command line
For hard tokenization, run
from geniml.tokenization import hard_tokenization
src_folder = '/path/to/raw/bed/files/'
dst_folder = '/path/to/tokenized_files/'
universe_file = '/path/to/universe_file.bed'
hard_tokenization(src_folder, dst_folder, universe_file, 1e-9)
We use the intersect
function of bedtools
to do tokenization. If you want to switch to different tools, you can override the bedtools_tokenization
function in hard_tokenization_batch.py
and provide the path to your tool by specifying the input argument bedtools_path
. The fraction
argument specifies the minimum overlap required as a fraction of some region in the universe (default: 1E-9,i.e. 1bp; maximum 1.0). A raw region will be mapped into a universe region when an overlap is above the threshold.
By default, the code assumes the binary bedtools
exists and can be called via command line. If bedtools
does not exists, the code will raise an exception. To solve this, please specify bedtools_path
which points to a bedtools binary.
Command line usage
geniml tokenize --data-folder /folder/with/raw/BED/files --token-folder ./tokens --universe /universe/file --bedtools-path bedtools
For more details, type geniml tokenize --help
.