Skip to content

How to use the tokenizers

Overview

The geniml tokenizers are used to prepare data for training, evaluation, and inference of genomic machine learning models. Like tokenizers for natural langauge processing, the geniml tokenizers convert raw data into a format that can be used by our models. geniml has a few tokenizers, but they all follow the same principles.

All tokenizers require a universe file (or, vocab file). This is a bedfile that contains all possible regions that can be tokenized. It may also include special tokens like the start, end, unknown, and padding token.

In addition to tokenizers implemented here, we also have a standalone package called gtokenizers which provides tokenizer implementations in Rust with python bindings. The Rust implementations are much faster than the python implementations. We recommend using the Rust implementations whenever possible.

Using the tokenizers

To start using a tokenizer, simply pass it an appropriate universe file:

from geniml.tokenization import ITTokenizer # or any other tokenizer
from geniml.io import RegionSet

rs = RegionSet("/path/to/file.bed")
t = ITTokenizer("/path/to/universe.bed")

tokens = t.tokenize(rs)
for token in tokens:
    print(f"{t.chr}:{t.start}-{t.end}")

You can also get token ids for the tokens:

from geniml.tokenization import ITTokenizer # or any other tokenizer
from geniml.io import RegionSet

rs = RegionSet("/path/to/file.bed")
t = ITTokenizer("/path/to/universe.bed")

model = Region2Vec(len(t), 100) # 100 dimensional embedding
tokens = t.tokenize(rs)

out = model(tokens.ids)
print(out.shape)

Future work

Genomic region tokenization is an active area of research. We will implement new tokenizers as they are developed. If you have a tokenizer you'd like to see implemented, please open an issue or submit a pull request.

For core development of our tokenizers, see the gtokenizers repository.