Fine-tune Atacformer on your data

Often, you will want to fine-tune a pre-trained Atacformer model on your own dataset. This is a common practice in transfer learning, where you take a model that has been pre-trained on a large dataset and adapt it to your specific dataset.

Prerequisites

Before starting, ensure you have:

A pre-trained Atacformer model (e.g., databio/atacformer-base-hg38)
Pre-tokenized your dataset. If not, see the pre-tokenize for training guide.

Setup training

We use a mixture of geniml and the transformers library to run training.

import torch

from datasets import Dataset
from transformers import TrainingArguments, Trainer
from atacformer import (
    AtacformerForReplacedTokenDetection,
    DataCollatorForReplacedTokenDetection,
    TrainingTokenizer,
)

torch.backends.cuda.matmul.allow_tf32 = True
torch.set_float32_matmul_precision("medium")
# optional for experiment tracking
# os.environ["WANDB_PROJECT"] = "atacformer-pretraining"

MLM_PROBABILITY = 0.45
BATCH_SIZE = 32
MAX_LEARNING_RATE = 1.5e-4
RUN_NAME = "atacformer-fine-tuning"

MODEL_TO_FINE_TUNE = "databio/atacformer-base-hg38"

Load your dataset

Load your pre-tokenized dataset. The training expects a pre-tokenized dataset in Parquet format with the following columns: - input_ids: Tokenized genomic regions

dataset_path = "path/to/dataset.parquet"
tokenized_dataset = Dataset.from_parquet(dataset_path)
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=42)

Create the tokenizer and data collator

Next, set up the tokenizer. The tokenizer is created from a universe file that defines the genomic regions:

tokenizer = TrainingTokenizer.from_pretrained(MODEL_TO_FINE_TUNE)
data_collator = DataCollatorForReplacedTokenDetection(
    tokenizer=tokenizer,
    mlm_probability=MLM_PROBABILITY,
)

Create the model

Grab the pre-trained model weights from the Hugging Face Hub:

model = AtacformerForReplacedTokenDetection.from_pretrained(MODEL_TO_FINE_TUNE) 
model = model.to(torch.bfloat16)  # use bfloat16 for training (its faster on ampere GPUs)

Training arguments

Set up the training arguments. You can adjust these based on your hardware and dataset size:

training_args = TrainingArguments(
    output_dir="atacformer-fine-tuning-output",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    eval_strategy="no",
    logging_strategy="steps",
    logging_steps=10,
    run_name=RUN_NAME,
    warmup_steps=500,
    lr_scheduler_type="cosine_with_restarts",
    learning_rate=MAX_LEARNING_RATE,
    bf16=True,
    max_grad_norm=1.0
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
)

Train your model

Finally, start the training process:

trainer.train()
model.save_pretrained("output/atacformer-fine-tuned")

Evaluate your model

Your model can now be used like any other Atacformer model. You can evaluate it on your test set or use it for downstream tasks such as cell clustering, classification, or regression.