Geneformer Module

geneformer

Classes

GeneformerConfig

GeneformerConfig(vocab_size=20275, hidden_size=512, intermediate_size=1024, num_attention_heads=8, num_hidden_layers=12, attention_probs_dropout_prob=0.02, hidden_act='relu', hidden_dropout_prob=0.02, initializer_range=0.02, layer_norm_eps=1e-12, max_position_embeddings=4096, pad_token_id=0, classifier_dropout=None, **kwargs)

Bases: BertConfig

Configuration for Geneformer model, a BERT-like transformer for gene tokens.

GeneformerModel

GeneformerModel(config)

Bases: BertForMaskedLM

Geneformer Model with a masked language modeling head.

Functions

get_input_embeddings

get_input_embeddings()

Returns the input embeddings of the model.

set_input_embeddings

set_input_embeddings(value)

Sets the input embeddings of the model.

TranscriptomeTokenizer

TranscriptomeTokenizer(custom_attr_name_dict=None, nproc=1, chunk_size=512, model_input_size=4096, special_token=True, collapse_gene_ids=True, gene_median_file=None, token_dictionary_file=None, gene_mapping_file=None)

Initialize tokenizer.

Parameters:

custom_attr_name_dict : None, dict | Dictionary of custom attributes to be added to the dataset. | Keys are the names of the attributes in the loom file. | Values are the names of the attributes in the dataset. nproc : int | Number of processes to use for dataset mapping. chunk_size : int = 512 | Chunk size for anndata tokenizer. model_input_size : int = 4096 | Max input size of model to truncate input to. | For the 30M model series, should be 2048. For the 95M model series, should be 4096. special_token : bool = True | Adds CLS token before and EOS token after rank value encoding. | For the 30M model series, should be False. For the 95M model series, should be True. collapse_gene_ids : bool = True | Whether to collapse gene IDs based on gene mapping dictionary. gene_median_file : Path | Path to pickle file containing dictionary of non-zero median | gene expression values across Genecorpus-30M. token_dictionary_file : Path | Path to pickle file containing token dictionary (Ensembl IDs:token). gene_mapping_file : None, Path | Path to pickle file containing dictionary for collapsing gene IDs.

Functions

tokenize_data

tokenize_data(data_directory, output_directory, output_prefix, file_format='loom', use_generator=False)

Tokenize .loom files in data_directory and save as tokenized .dataset in output_directory.

Parameters:

data_directory : Path | Path to directory containing loom files or anndata files output_directory : Path | Path to directory where tokenized data will be saved as .dataset output_prefix : str | Prefix for output .dataset file_format : str | Format of input files. Can be "loom" or "h5ad". use_generator : bool | Whether to use generator or dict for tokenization.