How to evaluate genomic region embeddings
Preparation
Create a Base Embedding Object
Given a set of genomic region embeddings embeddings
and the corresponding regions vocab
, use BaseEmbeddings
to create an base
embedding object.
from geniml.eval.utils import BaseEmbeddings
import pickle
base_obj = BaseEmbeddings(embeddings, vocab)
with open("base_embed.pt", "wb") as f:
pickle.dump(base_obj, f)
Generate Binary Embeddings
from geniml.eval.utils import get_bin_embeddings
universe_file = "/path/to/universe.bed"
token_files = ["file1.bed", "file2.bed"]
bin_embed = get_bin_embeddings(universe_file, token_files)
Or use command line:
geniml eval bin-gen --universe /path/to/universe.bed --token-folder /path/to/tokenized/folder --file-name bin_embed.pickle
Statistical Tests
Cluster Tendency Test (CTT)
CTT analyzes how well a set of region embeddings can be clustered. CTT score lies between 0 and 1. A larger CTT score indicates a greater tendency for the embeddings being evaluated to have clusters. When the embeddings are uniformly distributed, the score is 0.5. For evenly spaced embeddings, the score approaches 0.
from geniml.eval.ctt import get_ctt_score, ctt_eval
path = "/path/to/a/region2vec/model/"
embed_type = "region2vec"
ctt_score = get_ctt_score(path, embed_type, seed=42, num_data=10000, num_workers=10)
print(ctt_score)
# evaluate a batch of models and run CTT for 5 times with different random seeds
batch = [(path, embed_type)]
ctt_score_arr = ctt_eval(batch, num_runs=5, num_data=10000, num_workers=10)
print(f"Model: {ctt_score_arr[0][0]}\n CTT scores:{ctt_score_arr[0][1]}") # CTT scores for the 1st model in the batch
Or use the command line
geniml eval ctt --model-path /path/to/a/region2vec/model/ --embed-type region2vec
Reconstruction Test (RCT)
RCT evaluates how well an embedding of a region preserves the region’s occurrence information in the training data. The best RCT score is 1.
from geniml.eval.rct import get_rct_score, rct_eval
path = "/path/to/a/region2vec/model/"
embed_type = "region2vec"
bin_path = "/path/to/a/binary/embedding/for/the/same/tokenized/files/"
# set out_dim to -1 use all the dimensions of the binary embeddings. Set out_dim to a small positive number to reduce computational complexity.
rct_score = get_rct_score(path, embed_type, bin_path, out_dim=-1, cv_num=5, seed=42, num_workers=10)
print(rct_score)
# evaluate a batch of models and run RCT for 5 times with different random seeds
batch = [(path, embed_type, bin_path)]
rct_score_arr = rct_eval(batch, num_runs=5, cv_num=5, out_dim=-1, num_workers=10)
print(f"Model: {rct_score_arr[0][0]}\n RCT scores:{rct_score_arr[0][1]}") # RCT scores for the 1st model in the batch
Or use the command line
geniml eval rct --model-path /path/to/a/region2vec/model/ --embed-type region2vec
get_rct_score
in geniml/eval/rct.py
and change the constructor of MLPRegressor
.
Biological Tests
Genome Distance Scaling Test (GDST)
GDST calculates a score measuring how much the embedding distance between two regions scales the corresponding genome distance.
from geniml.eval.gdst import get_gdst_score, gdst_eval
path = "/path/to/a/region2vec/model/"
embed_type = "region2vec"
gdst_score = get_gdst_score(path, embed_type, num_samples=10000, seed=42)
print(gdst_score)
# evaluate a batch of models and run GDST for 5 times with different random seeds
batch = [(path, embed_type)]
gdst_score_arr = gdst_eval(batch, num_runs=5, num_samples=10000)
Or use the command line
geniml eval gdst --model-path /path/to/a/region2vec/model/ --embed-type region2vec
Neighborhood Preserving Test (NPT)
NPT evaluates how significant genomic region embeddings preserve their neighboring regions on the genome against random embeddings. The code output the NPT score for a set of region embeddings.
from geniml.eval.npt import get_npt_score, npt_eval
path = "/path/to/a/region2vec/model/"
embed_type = "region2vec"
K = 10
# If resolution = K gives NPT for K neighbors
# If resolution < K, gives NPT for [resolution, resolution*2, ...] neighbors
resolution = K
npt_score = get_npt_score(path, embed_type, K, num_samples=100, seed=0, resolution=resolution, num_workers=10)
print(npt_score['SNPR'])
# evaluate a batch of models and run NPT for 5 times with different random seeds
batch = [(path, embed_type)]
npt_score_arr = npt_eval(batch, K, num_samples=100, num_workers=10, num_runs=5, resolution=resolution)
print(f"Model: {npt_score_arr[0][0]}\n NPT scores: {npt_score_arr[0][1]}") # NPT scores for the 1st model in the batch
Or use the command line (the output will be the result when resolution=K)
geniml eval npt --model-path /path/to/a/region2vec/model/ --embed-type region2vec --K 50 --num-samples 1000