How to load a vector database with cell embeddings
Overview
In this tutorial, we will show how to load a vector database with cell embeddings. There are many benefits to storing cell-embeddings in a vector database: 1. Speed: Loading a vector database is much faster than re-encoding cells. 2. Reproducibility: You can share your cell embeddings with others. 3. Flexibility: You can use the same cell embeddings for many different analyses. 4. Interoperability: You can use the same cell embeddings with many different tools.
In a subsequent tutorial, we will show how to use a vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.
Preqrequisites
There are two core components to this tutorial: 1) the pre-trained model, and 2) the vector database.
Pre-trained model:
I will be using the databio/luecken2021
model. It was trained on the Luecken2021 dataset, a first-of-its-kind multimodal benchmark dataset of 120,000 single cells from the human bone marrow of 10 diverse donors measured with two commercially-available multi-modal technologies: nuclear GEX with joint ATAC, and cellular GEX with joint ADT profiles.
Vector database:
Vector databases are a new and exciting technology that allow you to store and query high-dimensional vectors very quickly. This tutorial will use the qdrant
vector database. As a lab, we really like qdrant
because it is fast, easy to use, and has a great API. You can learn more about qdrant
here. For qdrant
setup, please refer to the qdrant documentation. In the end, you should have a running qdrant
instance at http://localhost:6333
.
Data preparation
Grab a fresh copy of the Luecken2021 data from the geo accession. We want the multiome
data. This dataset contains the binary accessibility matrix, the peaks, and the barcodes. It also conveniently contains the cell-type labels. Pre-trained models also requires that the data be in a scanpy.AnnData
format and the .var
attribute contain chr
, start
, and end
values.
import scanpy as sc
adata = sc.read_h5ad("path/to/adata.h5ad")
adata = adata[:, adata.var['feature_types'] == 'ATAC']
Getting embeddings
We can easily get embeddings of the dataset using the pre-trained model:
import scanpy as sc
from geniml.scembed import ScEmbed
adata = sc.read_h5ad("path/to/adata.h5ad")
model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
embeddings = model.encode(adata)
adata.obsm['scembed_X'] = np.array(embeddings)
Loading the vector database
With the embeddings, we can now upsert them to qdrant
. Ensure you have qdrant_client
installed:
pip install qdrant-client
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct
client = QdrantClient("localhost", port=6333)
client.create_collection(
collection_name="luecken2021",
vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.DOT),
)
embeddings, cell_types = adata.obsm['scembed_X'], adata.obs['cell_type']
points = []
for embedding, cell_type, i in zip(embeddings, cell_types, range(len(embeddings)):
points.append(
PointStruct(
id=adata.obs.index[i],
vector=embedding.tolist(),
payload={"cell_type": cell_type}
))
client.upsert(collection_name="luecken2021", points=points, wait=True)
You should now have a vector database with cell embeddings. In the next tutorial, we will show how to use this vector database to query cell embeddings and annotate cells with cell-type labels using a KNN classification algorithm.