Skip to content

GlobalRefgetStore Python Reference

This is a Python-specific reference guide that provides quick examples for using the GlobalRefgetStore class from the gtars.refget module. For detailed information about the underlying RefgetStore file format specification, see refget-store-format.md.

Creating and Populating a Store

from gtars.refget import GlobalRefgetStore, StorageMode, digest_fasta

# Create a new store in Encoded mode (space-efficient)
store = GlobalRefgetStore(StorageMode.Encoded)
print(f"Initialized store: {store}")

# Add sequences from a FASTA file
store.add_sequence_collection_from_fasta("genome.fa")

# Inspect what's in the store
sequence_records = store.sequence_records()
sequence_metadata = store.sequence_metadata()
collections = store.collections()

# Access individual sequences
first_seq = sequence_records[0]
print(f"First sequence: {first_seq.metadata.name}")

# Decode sequence data to string
if first_seq.sequence:
    decoded = first_seq.decode()
    print(f"Sequence: {decoded}")

Saving and Loading Local Stores

import os

# Save the store to disk
store_path = "my_refget_store"
store.write_store_to_dir(store_path, "sequences/%s2/%s.seq")

# Load a local store
loaded_store = GlobalRefgetStore.load_local(store_path)

Loading Remote Stores with Caching

You can load stores from remote URLs (HTTP/HTTPS) with local caching:

# Load from a remote server with local caching
cache_dir = "local_cache"
remote_url = "https://refget-server.example.com/hg38"
remote_store = GlobalRefgetStore.load_remote(cache_dir, remote_url)

# Get sequence metadata
seq_metadata = list(remote_store.sequence_metadata())
first_seq = seq_metadata[0]

# Get a substring (automatically fetches and caches data)
substring = remote_store.get_substring(first_seq.sha512t24u, 0, 1000)
print(f"First 1000 bases: {substring[:50]}...")

# Iterate over all sequences in the store
for seq_meta in remote_store:
    print(f"{seq_meta.name}: {seq_meta.length} bp")

Working with Collections

# Get collections in the store
collections = store.collections()
collection = collections[0]

# Get a sequence by collection and name
record = store.get_sequence_by_collection_and_name(
    collection.digest,
    "chr1"
)

# Export entire collection to FASTA
store.export_fasta(
    collection.digest,
    "output.fa",
    sequence_names=None,  # None = all sequences
    line_width=80
)

# Export specific sequences from a collection
store.export_fasta(
    collection.digest,
    "chr1_and_chr2.fa",
    sequence_names=["chr1", "chr2"],
    line_width=80
)

Extracting Regions from BED Files

# Get sequences for regions defined in a BED file
retrieved_seqs = store.substrings_from_regions(
    collection.digest,
    "regions.bed"
)

for seq in retrieved_seqs:
    print(f"{seq.chrom_name}:{seq.start}-{seq.end} = {seq.sequence}")

# Export BED regions to a FASTA file
store.export_fasta_from_regions(
    collection.digest,
    "regions.bed",
    "output_regions.fa"
)

Local HTTP Server Example

For testing remote loading locally, you can serve a store directory:

# In the directory containing your refget store
python -m http.server 8200

Then connect to it:

remote_store = GlobalRefgetStore.load_remote(
    "local_cache",
    "http://localhost:8200/my_refget_store/"
)

# Use it like any other store
substring = remote_store.get_substring(seq_digest, 0, 100)

More Information

For a comprehensive tutorial with detailed examples, see refgetstore.ipynb.

For the full API documentation, visit the gtars repository.