Skip to content

Module overviews

geniml is organized into modules. Each module groups together related tasks. This document provides an overview of each module.

Module assess-universe

Many genomic interval analysis methods, particularly those used by geniml require that regions be re-defined in terms of a consensus region set, or universe. However, a universe may not be a good fit to a collection of files. This module assesses that fit. Given a collection of genomic interval sets, and a proposed universe, we can assess how well the universe fits the genomic interval sets. This module provides several complementary methods to assess fit.

Module bbclient

The bbclient module can download BED files or BED sets from BEDbase and cache them into local folders.

Module bedspace

The bedspace module uses the StarSpace method (Wu et al., 2018) to jointly embed genomic interval regions sets with associated metadata into a shared latent embedding space. This facilitates fast search and retrieval of similar region sets and their associated metadata.

Module build-universe

This module provides multiple ways to build a genomic region universe. These include: 1. HMM: uses an HMM to create a flexible segment universe, given an input of several bed files.

Module evaluation

Once a geniml region embedding model is trained, we may want to evaluate the embeddings. The evaluation module provides several functions for that. These include statistical tests, like the Cluster Tendency Test (CTT) and the Reconstruction Test (RCT), and biological tests, the Genome Distance Scaling Test (GDST) and the Neighborhood Preserving Test (NPT). These evaluation metrics can be helpful to determine if your models are working well, optimize training parameters, etc.

Module region2vec

Region2Vec is an unsupervised method for creating embeddings for genomic regions and region sets from a set of raw BED files. The program uses a variation of the word2vec algorithm by building shuffled context windows from BED files. The co-occurence statistics of genomic regions in a collection of BED files allow the model to learn region embeddings.

Module scembed

scEmbed is a single-cell implementation of region2Vec: a method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. scEmbed allows for dimensionality reduction and feature selection of single-cell ATAC-seq data; a notoriously sparse and high-dimensional data type. We intend for scEmbed to be used with the scanpy package. As such, it natively accepts AnnData objects as input and returns AnnData objects as output.

The search module provides a generic interface for vector search. Several geniml modules (such as region2vec), will create embeddings for different entities. The search module provides interfaces that store vectors and perform fast k-nearest neighbors (KNN) search with a given query vector. Back-end options include a database backend (using qdrant-client) and local file backend (using hnswlib.

Module text2bednn

Vec2Vec is a feedforward neural network that maps vectors from the embedding space of natural language (such as embeddings created by fastembed) to the embedding space of BED (such as embeddings created by Region2Vec). By mapping the embedding of natural language query strings to the space of BED files, Vec2Vec can perform natural language search of BED files.

Module tokenization

In NLP, training word embeddings requires first tokenizing words such that words in different forms are represented by one word. For example, "orange", "oranges" and "Orange" are all mapped to "orange" since they essentially convey the same meaning. This reduces the vocabulary size and improves the quality of learned embeddings. Similary, many geniml modules (such as region2vec) require first tokenizating regions.

To tokenize reigons, we need to provide a universe, which specifies the "vocabulary" of genomic regions. The universe is a BED file, containing representative regions. With the given universe, we represent (tokenize) raw regions into the regions in the universe.

Different strategies can be used to tokenize. The simplest case we call hard tokenization, which means if the overlap between a raw region in a BED file and a region in the universe exceeds a certain amount, then we use the region in the universe to represent this raw region; otherwise, we ignore this raw region. This is a "zero or one" process. After hard tokenization, each BED file will contain only regions from the universe, and the number of regions will be smaller or equal to the original number.