Generating a UMAP from fragments files

One of the most common single-cell ATAC-seq data formats is the fragments file format from 10X Genomics. These files contain information about the genomic regions that were accessible in individual cells during the ATAC-seq experiment.

Getting the data

To start, lets grab an example fragments file:

wget "https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gz" -O human_brain_3k_atac_fragments.tsv.gz

Tokenize first

Remember that we always tokenize first, then infer after. We will start by tokenizing the fragments file:

from gtars.tokenizers import tokenize_fragments_file, Tokenizer

tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
tokens = tokenize_fragments_file("human_brain_3k_atac_fragments.tsv.gz", tokenizer)

tokens is now a list of dictionaries, where each key is a unique cell barcode, and then each value is a list of input_ids for the corresponding cell.

Basic QC

Before inferring, lets remove cells with very low and very high fragment counts:

min_fragments = 200
max_fragments = 10_000

filtered_tokens = {k: v for k, v in tokens.items() if min_fragments <= len(v) <= max_fragments}

Infer embeddings

Now that we have tokenized the fragments file and performed basic QC, we can infer embeddings for the filtered tokens:

from gtars.models import Atacformer

model = Atacformer.from_pretrained("databio/atacformer-base-hg38")
embeddings = model.encode_tokenized_cells(
    input_ids=filtered_tokens.values(),
    batch_size=32
)

# detach embeddings from the computation graph and convert to numpy
embeddings = embeddings.detach().cpu().numpy()

Generate UMAP

Now that we have the embeddings, we can generate a UMAP:

from umap import UMAP
import matplotlib.pyplot as plt

reducer = UMAP(n_components=2, random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

plt.scatter(umap_embeddings[:, 0], umap_embeddings[:, 1])
plt.show()