Skip to content

How to tokenize a BED file on the command line

For hard tokenization, run

from geniml.tokenization import hard_tokenization

src_folder = '/path/to/raw/bed/files/'
dst_folder = '/path/to/tokenized_files/'
universe_file = '/path/to/universe_file.bed'
hard_tokenization(src_folder, dst_folder, universe_file, 1e-9)

We use the intersect function of bedtools to do tokenization. If you want to switch to different tools, you can override the bedtools_tokenization function in and provide the path to your tool by specifying the input argument bedtools_path. The fraction argument specifies the minimum overlap required as a fraction of some region in the universe (default: 1E-9,i.e. 1bp; maximum 1.0). A raw region will be mapped into a universe region when an overlap is above the threshold.

By default, the code assumes the binary bedtools exists and can be called via command line. If bedtools does not exists, the code will raise an exception. To solve this, please specify bedtools_path which points to a bedtools binary.

Command line usage

geniml tokenize --data-folder /folder/with/raw/BED/files --token-folder ./tokens --universe /universe/file --bedtools-path bedtools

For more details, type geniml tokenize --help.