GlobalRefgetStore Tutorial¶
This notebook demonstrates how to use the GlobalRefgetStore class from the gtars.refget module for managing reference genome sequences using the GA4GH refget standard.
Features Covered¶
- Creating and populating a local RefgetStore
- Retrieving sequences by ID and collection name
- Getting substrings and BED file regions
- Extracting complete FASTA from a sequence collection
- Saving and loading stores
- Loading remote RefgetStores with lazy loading
- Working with cached remote sequences
Setup¶
First, import the necessary modules:
import os
import tempfile
from gtars.refget import GlobalRefgetStore, StorageMode, digest_fasta
Part 1: Local RefgetStore - Full Workflow¶
This section demonstrates the complete workflow for creating and using a local RefgetStore.
1.1 Create a Sample FASTA File¶
We'll create a temporary FASTA file with two chromosomes:
# Create a temporary directory for our example
temp_dir = tempfile.mkdtemp()
print(f"Working in temporary directory: {temp_dir}")
# Create a sample FASTA file
fasta_content = (
">chr1\n"
"ATGCATGCATGCAGTCGTAGC\n"
">chr2\n"
"GGGGAAAA\n"
)
source_fasta_path = os.path.join(temp_dir, "source.fa")
with open(source_fasta_path, "w") as f:
f.write(fasta_content)
print(f"Created FASTA file: {source_fasta_path}")
1.2 Digest the FASTA to Get Collection Information¶
The digest_fasta function computes GA4GH-compliant digests for each sequence:
# Digest the FASTA to get collection info
collection = digest_fasta(source_fasta_path)
collection_digest = collection.digest
print(f"Collection digest: {collection_digest}")
print(f"Number of sequences: {len(collection)}")
print(f"\nSequences in collection:")
for seq in collection:
print(f" - {seq.metadata.name}: {seq.metadata.sha512t24u}")
1.3 Initialize and Populate a GlobalRefgetStore¶
Create a store in Encoded mode for efficient storage:
# Initialize store in Encoded mode
store = GlobalRefgetStore(StorageMode.Encoded)
print(f"Initialized store: {store}")
# Import the FASTA file into the store
store.add_sequence_collection_from_fasta(source_fasta_path)
print("\nFASTA imported into the store.")
print(f"Store now contains: {store}")
1.4 Retrieve Sequences by ID¶
Get a complete sequence using its digest:
# Get the digest for chr1
seq_digest_chr1 = collection[0].metadata.sha512t24u
# Retrieve the sequence record
record_chr1 = store.get_sequence_by_id(seq_digest_chr1)
if record_chr1:
print(f"Retrieved sequence: {record_chr1.metadata.name}")
print(f" Length: {record_chr1.metadata.length}")
print(f" Alphabet: {record_chr1.metadata.alphabet}")
# Get the full sequence
full_seq = store.get_substring(seq_digest_chr1, 0, record_chr1.metadata.length)
print(f" Sequence: {full_seq}")
1.5 Get Substrings¶
Extract specific regions from sequences:
# Get a substring from chr1 (positions 5-15)
sub_seq = store.get_substring(seq_digest_chr1, 5, 15)
print(f"Substring chr1[5:15]: {sub_seq}")
# Get a substring from chr2
seq_digest_chr2 = collection[1].metadata.sha512t24u
sub_seq_chr2 = store.get_substring(seq_digest_chr2, 0, 4)
print(f"Substring chr2[0:4]: {sub_seq_chr2}")
1.6 Retrieve Regions from BED Files¶
Extract multiple regions specified in a BED file:
# Create a BED file with regions of interest
bed_content = (
"chr1\t0\t10\n"
"chr2\t2\t6\n"
"chr_nonexistent\t0\t5\n" # This will be skipped
)
bed_path = os.path.join(temp_dir, "regions.bed")
with open(bed_path, "w") as f:
f.write(bed_content)
# Retrieve sequences as a list
retrieved_list = store.substrings_from_regions(collection_digest, bed_path)
print("Retrieved sequences from BED file:")
for rs in retrieved_list:
print(f" {rs.chrom_name}[{rs.start}-{rs.end}]: {rs.sequence}")
1.7 Write Retrieved Sequences to FASTA¶
Save BED file regions as a new FASTA file:
# Write retrieved sequences to a FASTA file
output_fasta_path = os.path.join(temp_dir, "output_regions.fa")
store.export_fasta_from_regions(collection_digest, bed_path, output_fasta_path)
print(f"Sequences written to: {output_fasta_path}\n")
with open(output_fasta_path, "r") as f:
print("Content:")
print(f.read())
1.8 Extract Complete FASTA from a Sequence Collection¶
You can reconstruct the entire FASTA file from a sequence collection by creating a BED file that covers all chromosomes:
# Create a BED file covering all sequences in the collection
# Each line is: chrom_name 0 length (whole chromosome)
all_seqs_bed = os.path.join(temp_dir, "all_sequences.bed")
with open(all_seqs_bed, "w") as f:
for seq in collection:
f.write(f"{seq.metadata.name}\t0\t{seq.metadata.length}\n")
print("BED file covering all sequences:")
with open(all_seqs_bed, "r") as f:
print(f.read())
# Extract complete FASTA from the collection
extracted_fasta_path = os.path.join(temp_dir, "extracted_complete.fa")
store.export_fasta_from_regions(collection_digest, all_seqs_bed, extracted_fasta_path)
print(f"\nExtracted complete FASTA to: {extracted_fasta_path}")
with open(extracted_fasta_path, "r") as f:
print("\nContent:")
print(f.read())
Alternative: You can also extract sequences programmatically without a BED file:
# Extract all sequences from a collection programmatically
manual_fasta_path = os.path.join(temp_dir, "manual_extraction.fa")
with open(manual_fasta_path, "w") as f:
for seq in collection:
# Get the sequence data
seq_data = store.get_substring(
seq.metadata.sha512t24u,
0,
seq.metadata.length
)
# Write FASTA header and sequence
f.write(f">{seq.metadata.name}\n")
f.write(f"{seq_data}\n")
print(f"Manually extracted FASTA to: {manual_fasta_path}")
with open(manual_fasta_path, "r") as f:
print("\nContent:")
print(f.read())
1.9 Save Store to Disk¶
Persist the store for later use:
# Save the store to a directory
saved_store_path = os.path.join(temp_dir, "my_refget_store")
store.write_store_to_dir(saved_store_path, "sequences/%s2/%s.seq")
print(f"Store saved to: {saved_store_path}")
print(f"\nStore structure:")
for root, dirs, files in os.walk(saved_store_path):
level = root.replace(saved_store_path, '').count(os.sep)
indent = ' ' * 2 * level
print(f"{indent}{os.path.basename(root)}/")
subindent = ' ' * 2 * (level + 1)
for file in files:
print(f"{subindent}{file}")
1.10 Load Store from Disk¶
Load a previously saved store (with lazy loading):
# Load the store using load_local method (with lazy loading)
loaded_store = GlobalRefgetStore.load_local(saved_store_path)
print(f"Store loaded from: {saved_store_path}")
print(f"Loaded store: {loaded_store}")
# Verify we can retrieve sequences (data loaded on-demand)
seq = loaded_store.get_substring(seq_digest_chr1, 0, 10)
print(f"\nRetrieved sequence from loaded store: {seq}")
Part 2: Remote RefgetStore with Lazy Loading¶
The load_remote() method allows you to work with RefgetStores hosted on remote servers without downloading everything upfront. Sequence data is fetched on-demand and cached locally.
2.1 Understanding Remote Stores¶
Key Concepts:
- Metadata loaded immediately: Index files (
index.json,sequences.farg) are fetched from remote - Sequence data loaded on-demand: Actual
.seqfiles are only downloaded when accessed - Local caching: Downloaded sequences are cached to avoid re-fetching
- User-controlled cache: You specify where cached data is stored
2.2 Simulating a Remote Store¶
For this example, we'll use the store we created earlier and treat it as a "remote" store using a file:// URL. In practice, you'd use http:// or https:// URLs pointing to an actual remote server.
# Create a cache directory for remote data
cache_dir = os.path.join(temp_dir, "refget_cache")
os.makedirs(cache_dir, exist_ok=True)
print(f"Cache directory: {cache_dir}")
# Simulate a remote URL (in practice, this would be https://...)
remote_url = f"file://{saved_store_path}"
print(f"Remote URL: {remote_url}")
2.3 Load Remote Store¶
Load a store from a remote URL with local caching:
# Load remote store
# In a real scenario, remote_url would be something like:
# "https://s3.amazonaws.com/genomes/hg38" or "https://refget.example.com/store"
remote_store = GlobalRefgetStore.load_remote(cache_dir, remote_url)
print(f"Remote store loaded: {remote_store}")
print(f"\nCache directory contents (metadata only at this point):")
for root, dirs, files in os.walk(cache_dir):
for file in files:
rel_path = os.path.relpath(os.path.join(root, file), cache_dir)
print(f" - {rel_path}")
2.4 Access Sequences (Triggers Lazy Loading)¶
When you request a sequence, it's automatically fetched and cached:
# First access - triggers download and caching
print("First access (will fetch from remote and cache):")
seq1 = remote_store.get_substring(seq_digest_chr1, 0, 10)
print(f" Sequence: {seq1}")
print(f"\nCache directory contents (after first access):")
for root, dirs, files in os.walk(cache_dir):
for file in files:
rel_path = os.path.relpath(os.path.join(root, file), cache_dir)
file_size = os.path.getsize(os.path.join(root, file))
print(f" - {rel_path} ({file_size} bytes)")
# Second access - uses cached data (no network request)
print("\nSecond access (uses cache, no network request):")
seq2 = remote_store.get_substring(seq_digest_chr1, 5, 15)
print(f" Sequence: {seq2}")
2.5 Real-World Example Template¶
Here's how you would use this with an actual remote RefgetStore:
# Example: Loading a remote genome (pseudocode - URL would need to be real)
'''
# Load human genome hg38 from a remote server
cache_path = "/data/refget_cache/hg38"
remote_url = "https://refget-server.example.com/hg38"
hg38_store = GlobalRefgetStore.load_remote(cache_path, remote_url)
# Get chr1 sequence (only chr1 data is downloaded and cached)
chr1_digest = "..."
sequence = hg38_store.get_substring(chr1_digest, 1000000, 1001000)
# Subsequent accesses use the cached chr1 data
another_region = hg38_store.get_substring(chr1_digest, 2000000, 2001000)
'''
print("See code cell for real-world usage example")
Part 3: Comparison - Local vs Remote Loading¶
When to Use Each Method¶
| Method | Use Case | Pros | Cons |
|---|---|---|---|
load_local(path) |
Store is already on disk | Fast, no network | Requires full download |
load_remote(cache, url) |
Store hosted remotely | Only download what you need | Slower first access |
Performance Characteristics¶
Local Store:
- Initial load: Fast (metadata only)
- First sequence access: Fast (local disk)
- Subsequent access: Fast (cached in memory)
Remote Store:
- Initial load: 2 HTTP requests (index.json, sequences.farg)
- First sequence access: 1 HTTP request + cache write
- Subsequent access: Fast (cached on disk)
Benefits of Lazy Loading¶
- Memory efficient: Only loaded sequences consume memory
- Bandwidth efficient: Only download what you use
- Fast startup: No waiting for full genome download
- Selective caching: Control which sequences to cache locally
Cleanup¶
Remove temporary files:
import shutil
# Clean up temporary directory
shutil.rmtree(temp_dir)
print(f"Cleaned up temporary directory: {temp_dir}")
Summary¶
This notebook demonstrated:
- ✅ Creating and populating a RefgetStore from FASTA files
- ✅ Retrieving sequences by digest and collection name
- ✅ Extracting substrings and BED file regions
- ✅ Extracting complete FASTA files from sequence collections
- ✅ Saving stores to disk and loading them back
- ✅ Loading remote RefgetStores with lazy loading
- ✅ On-demand sequence fetching with local caching
For more information, see the gtars documentation.