Example Brain3k Processing

This tutorial demonstrates how to process a small dataset from 10X Genomics' Brain3k multiome dataset using Atacformer.

To start, we need to download the matrix files:

wget "https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_filtered_feature_bc_matrix.tar.gz" -O  "brain3k.tar.gz"

tar -xzf brain3k.tar.gz

There are three files in the extracted directory: barcodes.tsv, features.tsv, and matrix.mtx. These files contain the cell barcodes, the feature names, and the matrix data, respectively. We will use these files to build an AnnData object, which is a common format for storing single-cell data. The AnnData object will contain the genomic regions of interest in the .var dataframe:

import pandas as pd
from anndata import AnnData
from scipy.io import mmread

# read in the data
mtx = mmread("matrix.mtx.gz").T.tocsr()
barcodes = pd.read_csv("barcodes.tsv.gz", header=None, names=["barcode"], sep="\t")
features = pd.read_csv("features.tsv.gz", header=None, names=["feature_id", "feature_name", "type", "chr", "start", "end"], sep="\t")

# create the AnnData object, write it to disk
adata = AnnData(
    X=mtx,
    obs=barcodes,
    var=features.set_index("feature_id"),
)
adata.write_h5ad("brain3k.h5ad")

We need to subset the data to only include the ATAC-seq regions.

atac = adata[:, adata.var["type"] == "Peaks"]

Next, we will pre-tokenize the data using the Atacformer tokenizer. This step converts the genomic regions into input IDs that the Atacformer model can understand.

from geniml.atacformer import AtacformerForCellClustering
from geniml.tokenization import tokenize_anndata
from gtars.tokenizers import Tokenizer

model = AtacformerForCellClustering.from_pretrained("databio/atacformer-base-hg38")

tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
tokens = tokenize_anndata(atac, tokenizer)
tokens = [t["input_ids"] for t in tokens]

Now we can generate single-cell embedding and attach them back to the AnnData object. This will allow us to use the embeddings for downstream tasks such as clustering or visualization.

embeddings = model.encode_tokenized_cells(tokens)
atac.obsm["X_atacformer"] = embeddings.detach().cpu().numpy()

Finally, we can generate clusters and visualize the results using UMAP.

import scanpy as sc

sc.pp.neighbors(atac, use_rep="X_atacformer")
sc.tl.leiden(atac)

sc.pl.umap(atac, color=["leiden", "chr", "start", "end"], frameon=False, wspace=0.4, hspace=0.4)