Pre-tokenize data for training
Before doing any training with Atacformer, you need to pre-tokenize your genomic interval data. This process converts your data into a format that the Atacformer model can understand, specifically input ids that represent genomic regions.
For easy reading, writing, and manipulation of pre-tokenized data, we use the Parquet format. This format is efficient and works well with the huggingface ecosystem; namely, the datasets
library.
Prerequisites
Ensure you have geniml
and gtars
installed. gtars
is our companion library for genomic interval data processing -- it contains the tokenizers. geniml
provides the Atacformer model.
pip install geniml[ml] datasets gtars
Next, ensure you have a universe file that defines the genomic regions you want to tokenize. This file is typically in BED format and contains the regions of interest.
Finally, you need your data in the AnnData
format or as .fragments.tsv.gz
files. If you have your data in a different format, you may need to convert it first. If using AnnData
, please ensure that you have a chr
, start
, and end
column in the .var
dataframe.
Tokenization Process
To pre-tokenize your data, you can use the gtars
tokenizer. This tokenizer will read your genomic interval data and convert it into the Atacformer input format.
import scanpy as sc
import polars as pl
from geniml.tokenization import tokenize_anndata
from gtars.tokenizers import Tokenizer
# read in data; create a tokenizer
adata = sc.read("path/to/your/anndata.h5ad")
tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
# tokenize the data
tokens = tokenize_anndata(adata, tokenizer)
input_ids = [t["input_ids"] for t in tokens]
# optional: cutoff at context size
CONTEXT_SIZE = 8192
input_ids = [ids[:CONTEXT_SIZE] for ids in input_ids]
# convert to a DataFrame
df = pl.DataFrame({
"input_ids": input_ids,
})
df.write_parquet("path/to/tokenized_data.parquet")