GlobalRefgetStore Tutorial¶

This notebook demonstrates how to use the GlobalRefgetStore class from the gtars.refget module for managing reference genome sequences using the GA4GH refget standard.

Features Covered¶

Creating and populating a local RefgetStore
Retrieving sequences by ID and collection name
Getting substrings and BED file regions
Extracting complete FASTA from a sequence collection
Saving and loading stores
Loading remote RefgetStores with lazy loading
Working with cached remote sequences

Setup¶

First, import the necessary modules:

In [ ]:

Copied!

import os
import tempfile
from gtars.refget import GlobalRefgetStore, StorageMode, digest_fasta
import os
import tempfile
from gtars.refget import GlobalRefgetStore, StorageMode, digest_fasta

Part 1: Local RefgetStore - Full Workflow¶

This section demonstrates the complete workflow for creating and using a local RefgetStore.

1.1 Create a Sample FASTA File¶

We'll create a temporary FASTA file with two chromosomes:

In [ ]:

Copied!





# Create a temporary directory for our example
temp_dir = tempfile.mkdtemp()
print(f"Working in temporary directory: {temp_dir}")

# Create a sample FASTA file
fasta_content = (
    ">chr1\n"
    "ATGCATGCATGCAGTCGTAGC\n"
    ">chr2\n"
    "GGGGAAAA\n"
)
source_fasta_path = os.path.join(temp_dir, "source.fa")
with open(source_fasta_path, "w") as f:
    f.write(fasta_content)

print(f"Created FASTA file: {source_fasta_path}")
# Create a temporary directory for our example
temp_dir = tempfile.mkdtemp()
print(f"Working in temporary directory: {temp_dir}")

# Create a sample FASTA file
fasta_content = (
    ">chr1\n"
    "ATGCATGCATGCAGTCGTAGC\n"
    ">chr2\n"
    "GGGGAAAA\n"
)
source_fasta_path = os.path.join(temp_dir, "source.fa")
with open(source_fasta_path, "w") as f:
    f.write(fasta_content)

print(f"Created FASTA file: {source_fasta_path}")

1.2 Digest the FASTA to Get Collection Information¶

The digest_fasta function computes GA4GH-compliant digests for each sequence:

In [ ]:

Copied!





# Digest the FASTA to get collection info
collection = digest_fasta(source_fasta_path)
collection_digest = collection.digest

print(f"Collection digest: {collection_digest}")
print(f"Number of sequences: {len(collection)}")
print(f"\nSequences in collection:")
for seq in collection:
    print(f"  - {seq.metadata.name}: {seq.metadata.sha512t24u}")
# Digest the FASTA to get collection info
collection = digest_fasta(source_fasta_path)
collection_digest = collection.digest

print(f"Collection digest: {collection_digest}")
print(f"Number of sequences: {len(collection)}")
print(f"\nSequences in collection:")
for seq in collection:
    print(f"  - {seq.metadata.name}: {seq.metadata.sha512t24u}")

1.3 Initialize and Populate a GlobalRefgetStore¶

Create a store in Encoded mode for efficient storage:

In [ ]:

Copied!





# Initialize store in Encoded mode
store = GlobalRefgetStore(StorageMode.Encoded)
print(f"Initialized store: {store}")

# Import the FASTA file into the store
store.add_sequence_collection_from_fasta(source_fasta_path)
print("\nFASTA imported into the store.")
print(f"Store now contains: {store}")
# Initialize store in Encoded mode
store = GlobalRefgetStore(StorageMode.Encoded)
print(f"Initialized store: {store}")

# Import the FASTA file into the store
store.add_sequence_collection_from_fasta(source_fasta_path)
print("\nFASTA imported into the store.")
print(f"Store now contains: {store}")

1.4 Retrieve Sequences by ID¶

Get a complete sequence using its digest:

In [ ]:

Copied!





# Get the digest for chr1
seq_digest_chr1 = collection[0].metadata.sha512t24u

# Retrieve the sequence record
record_chr1 = store.get_sequence_by_id(seq_digest_chr1)

if record_chr1:
    print(f"Retrieved sequence: {record_chr1.metadata.name}")
    print(f"  Length: {record_chr1.metadata.length}")
    print(f"  Alphabet: {record_chr1.metadata.alphabet}")
    
    # Get the full sequence
    full_seq = store.get_substring(seq_digest_chr1, 0, record_chr1.metadata.length)
    print(f"  Sequence: {full_seq}")
# Get the digest for chr1
seq_digest_chr1 = collection[0].metadata.sha512t24u

# Retrieve the sequence record
record_chr1 = store.get_sequence_by_id(seq_digest_chr1)

if record_chr1:
    print(f"Retrieved sequence: {record_chr1.metadata.name}")
    print(f"  Length: {record_chr1.metadata.length}")
    print(f"  Alphabet: {record_chr1.metadata.alphabet}")
    
    # Get the full sequence
    full_seq = store.get_substring(seq_digest_chr1, 0, record_chr1.metadata.length)
    print(f"  Sequence: {full_seq}")

1.5 Get Substrings¶

Extract specific regions from sequences:

In [ ]:

Copied!





# Get a substring from chr1 (positions 5-15)
sub_seq = store.get_substring(seq_digest_chr1, 5, 15)
print(f"Substring chr1[5:15]: {sub_seq}")

# Get a substring from chr2
seq_digest_chr2 = collection[1].metadata.sha512t24u
sub_seq_chr2 = store.get_substring(seq_digest_chr2, 0, 4)
print(f"Substring chr2[0:4]: {sub_seq_chr2}")
# Get a substring from chr1 (positions 5-15)
sub_seq = store.get_substring(seq_digest_chr1, 5, 15)
print(f"Substring chr1[5:15]: {sub_seq}")

# Get a substring from chr2
seq_digest_chr2 = collection[1].metadata.sha512t24u
sub_seq_chr2 = store.get_substring(seq_digest_chr2, 0, 4)
print(f"Substring chr2[0:4]: {sub_seq_chr2}")

1.6 Retrieve Regions from BED Files¶

Extract multiple regions specified in a BED file:

In [ ]:

Copied!





# Create a BED file with regions of interest
bed_content = (
    "chr1\t0\t10\n"
    "chr2\t2\t6\n"
    "chr_nonexistent\t0\t5\n"  # This will be skipped
)
bed_path = os.path.join(temp_dir, "regions.bed")
with open(bed_path, "w") as f:
    f.write(bed_content)

# Retrieve sequences as a list
retrieved_list = store.substrings_from_regions(collection_digest, bed_path)

print("Retrieved sequences from BED file:")
for rs in retrieved_list:
    print(f"  {rs.chrom_name}[{rs.start}-{rs.end}]: {rs.sequence}")
# Create a BED file with regions of interest
bed_content = (
    "chr1\t0\t10\n"
    "chr2\t2\t6\n"
    "chr_nonexistent\t0\t5\n"  # This will be skipped
)
bed_path = os.path.join(temp_dir, "regions.bed")
with open(bed_path, "w") as f:
    f.write(bed_content)

# Retrieve sequences as a list
retrieved_list = store.substrings_from_regions(collection_digest, bed_path)

print("Retrieved sequences from BED file:")
for rs in retrieved_list:
    print(f"  {rs.chrom_name}[{rs.start}-{rs.end}]: {rs.sequence}")

1.7 Write Retrieved Sequences to FASTA¶

Save BED file regions as a new FASTA file:

In [ ]:

Copied!





# Write retrieved sequences to a FASTA file
output_fasta_path = os.path.join(temp_dir, "output_regions.fa")
store.export_fasta_from_regions(collection_digest, bed_path, output_fasta_path)

print(f"Sequences written to: {output_fasta_path}\n")
with open(output_fasta_path, "r") as f:
    print("Content:")
    print(f.read())
# Write retrieved sequences to a FASTA file
output_fasta_path = os.path.join(temp_dir, "output_regions.fa")
store.export_fasta_from_regions(collection_digest, bed_path, output_fasta_path)

print(f"Sequences written to: {output_fasta_path}\n")
with open(output_fasta_path, "r") as f:
    print("Content:")
    print(f.read())

1.8 Extract Complete FASTA from a Sequence Collection¶

You can reconstruct the entire FASTA file from a sequence collection by creating a BED file that covers all chromosomes:

In [ ]:

Copied!





# Create a BED file covering all sequences in the collection
# Each line is: chrom_name 0 length (whole chromosome)
all_seqs_bed = os.path.join(temp_dir, "all_sequences.bed")
with open(all_seqs_bed, "w") as f:
    for seq in collection:
        f.write(f"{seq.metadata.name}\t0\t{seq.metadata.length}\n")

print("BED file covering all sequences:")
with open(all_seqs_bed, "r") as f:
    print(f.read())

# Extract complete FASTA from the collection
extracted_fasta_path = os.path.join(temp_dir, "extracted_complete.fa")
store.export_fasta_from_regions(collection_digest, all_seqs_bed, extracted_fasta_path)

print(f"\nExtracted complete FASTA to: {extracted_fasta_path}")
with open(extracted_fasta_path, "r") as f:
    print("\nContent:")
    print(f.read())
# Create a BED file covering all sequences in the collection
# Each line is: chrom_name 0 length (whole chromosome)
all_seqs_bed = os.path.join(temp_dir, "all_sequences.bed")
with open(all_seqs_bed, "w") as f:
    for seq in collection:
        f.write(f"{seq.metadata.name}\t0\t{seq.metadata.length}\n")

print("BED file covering all sequences:")
with open(all_seqs_bed, "r") as f:
    print(f.read())

# Extract complete FASTA from the collection
extracted_fasta_path = os.path.join(temp_dir, "extracted_complete.fa")
store.export_fasta_from_regions(collection_digest, all_seqs_bed, extracted_fasta_path)

print(f"\nExtracted complete FASTA to: {extracted_fasta_path}")
with open(extracted_fasta_path, "r") as f:
    print("\nContent:")
    print(f.read())

Alternative: You can also extract sequences programmatically without a BED file:

In [ ]:

Copied!





# Extract all sequences from a collection programmatically
manual_fasta_path = os.path.join(temp_dir, "manual_extraction.fa")

with open(manual_fasta_path, "w") as f:
    for seq in collection:
        # Get the sequence data
        seq_data = store.get_substring(
            seq.metadata.sha512t24u,
            0,
            seq.metadata.length
        )
        
        # Write FASTA header and sequence
        f.write(f">{seq.metadata.name}\n")
        f.write(f"{seq_data}\n")

print(f"Manually extracted FASTA to: {manual_fasta_path}")
with open(manual_fasta_path, "r") as f:
    print("\nContent:")
    print(f.read())
# Extract all sequences from a collection programmatically
manual_fasta_path = os.path.join(temp_dir, "manual_extraction.fa")

with open(manual_fasta_path, "w") as f:
    for seq in collection:
        # Get the sequence data
        seq_data = store.get_substring(
            seq.metadata.sha512t24u,
            0,
            seq.metadata.length
        )
        
        # Write FASTA header and sequence
        f.write(f">{seq.metadata.name}\n")
        f.write(f"{seq_data}\n")

print(f"Manually extracted FASTA to: {manual_fasta_path}")
with open(manual_fasta_path, "r") as f:
    print("\nContent:")
    print(f.read())

1.9 Save Store to Disk¶

Persist the store for later use:

In [ ]:

Copied!





# Save the store to a directory
saved_store_path = os.path.join(temp_dir, "my_refget_store")
store.write_store_to_dir(saved_store_path, "sequences/%s2/%s.seq")

print(f"Store saved to: {saved_store_path}")
print(f"\nStore structure:")
for root, dirs, files in os.walk(saved_store_path):
    level = root.replace(saved_store_path, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 2 * (level + 1)
    for file in files:
        print(f"{subindent}{file}")
# Save the store to a directory
saved_store_path = os.path.join(temp_dir, "my_refget_store")
store.write_store_to_dir(saved_store_path, "sequences/%s2/%s.seq")

print(f"Store saved to: {saved_store_path}")
print(f"\nStore structure:")
for root, dirs, files in os.walk(saved_store_path):
    level = root.replace(saved_store_path, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 2 * (level + 1)
    for file in files:
        print(f"{subindent}{file}")

1.10 Load Store from Disk¶

Load a previously saved store (with lazy loading):

In [ ]:

Copied!





# Load the store using load_local method (with lazy loading)
loaded_store = GlobalRefgetStore.load_local(saved_store_path)

print(f"Store loaded from: {saved_store_path}")
print(f"Loaded store: {loaded_store}")

# Verify we can retrieve sequences (data loaded on-demand)
seq = loaded_store.get_substring(seq_digest_chr1, 0, 10)
print(f"\nRetrieved sequence from loaded store: {seq}")
# Load the store using load_local method (with lazy loading)
loaded_store = GlobalRefgetStore.load_local(saved_store_path)

print(f"Store loaded from: {saved_store_path}")
print(f"Loaded store: {loaded_store}")

# Verify we can retrieve sequences (data loaded on-demand)
seq = loaded_store.get_substring(seq_digest_chr1, 0, 10)
print(f"\nRetrieved sequence from loaded store: {seq}")

Part 2: Remote RefgetStore with Lazy Loading¶

The load_remote() method allows you to work with RefgetStores hosted on remote servers without downloading everything upfront. Sequence data is fetched on-demand and cached locally.

2.1 Understanding Remote Stores¶

Key Concepts:

Metadata loaded immediately: Index files (index.json, sequences.farg) are fetched from remote
Sequence data loaded on-demand: Actual .seq files are only downloaded when accessed
Local caching: Downloaded sequences are cached to avoid re-fetching
User-controlled cache: You specify where cached data is stored

2.2 Simulating a Remote Store¶

For this example, we'll use the store we created earlier and treat it as a "remote" store using a file:// URL. In practice, you'd use http:// or https:// URLs pointing to an actual remote server.

In [ ]:

Copied!





# Create a cache directory for remote data
cache_dir = os.path.join(temp_dir, "refget_cache")
os.makedirs(cache_dir, exist_ok=True)

print(f"Cache directory: {cache_dir}")

# Simulate a remote URL (in practice, this would be https://...)
remote_url = f"file://{saved_store_path}"
print(f"Remote URL: {remote_url}")
# Create a cache directory for remote data
cache_dir = os.path.join(temp_dir, "refget_cache")
os.makedirs(cache_dir, exist_ok=True)

print(f"Cache directory: {cache_dir}")

# Simulate a remote URL (in practice, this would be https://...)
remote_url = f"file://{saved_store_path}"
print(f"Remote URL: {remote_url}")

2.3 Load Remote Store¶

Load a store from a remote URL with local caching:

In [ ]:

Copied!





# Load remote store
# In a real scenario, remote_url would be something like:
# "https://s3.amazonaws.com/genomes/hg38" or "https://refget.example.com/store"
remote_store = GlobalRefgetStore.load_remote(cache_dir, remote_url)

print(f"Remote store loaded: {remote_store}")
print(f"\nCache directory contents (metadata only at this point):")
for root, dirs, files in os.walk(cache_dir):
    for file in files:
        rel_path = os.path.relpath(os.path.join(root, file), cache_dir)
        print(f"  - {rel_path}")
# Load remote store
# In a real scenario, remote_url would be something like:
# "https://s3.amazonaws.com/genomes/hg38" or "https://refget.example.com/store"
remote_store = GlobalRefgetStore.load_remote(cache_dir, remote_url)

print(f"Remote store loaded: {remote_store}")
print(f"\nCache directory contents (metadata only at this point):")
for root, dirs, files in os.walk(cache_dir):
    for file in files:
        rel_path = os.path.relpath(os.path.join(root, file), cache_dir)
        print(f"  - {rel_path}")

2.4 Access Sequences (Triggers Lazy Loading)¶

When you request a sequence, it's automatically fetched and cached:

In [ ]:

Copied!





# First access - triggers download and caching
print("First access (will fetch from remote and cache):")
seq1 = remote_store.get_substring(seq_digest_chr1, 0, 10)
print(f"  Sequence: {seq1}")

print(f"\nCache directory contents (after first access):")
for root, dirs, files in os.walk(cache_dir):
    for file in files:
        rel_path = os.path.relpath(os.path.join(root, file), cache_dir)
        file_size = os.path.getsize(os.path.join(root, file))
        print(f"  - {rel_path} ({file_size} bytes)")
# First access - triggers download and caching
print("First access (will fetch from remote and cache):")
seq1 = remote_store.get_substring(seq_digest_chr1, 0, 10)
print(f"  Sequence: {seq1}")

print(f"\nCache directory contents (after first access):")
for root, dirs, files in os.walk(cache_dir):
    for file in files:
        rel_path = os.path.relpath(os.path.join(root, file), cache_dir)
        file_size = os.path.getsize(os.path.join(root, file))
        print(f"  - {rel_path} ({file_size} bytes)")

In [ ]:

Copied!





# Second access - uses cached data (no network request)
print("\nSecond access (uses cache, no network request):")
seq2 = remote_store.get_substring(seq_digest_chr1, 5, 15)
print(f"  Sequence: {seq2}")
# Second access - uses cached data (no network request)
print("\nSecond access (uses cache, no network request):")
seq2 = remote_store.get_substring(seq_digest_chr1, 5, 15)
print(f"  Sequence: {seq2}")

2.5 Real-World Example Template¶

Here's how you would use this with an actual remote RefgetStore:

In [ ]:

Copied!





# Example: Loading a remote genome (pseudocode - URL would need to be real)
'''
# Load human genome hg38 from a remote server
cache_path = "/data/refget_cache/hg38"
remote_url = "https://refget-server.example.com/hg38"

hg38_store = GlobalRefgetStore.load_remote(cache_path, remote_url)

# Get chr1 sequence (only chr1 data is downloaded and cached)
chr1_digest = "..."
sequence = hg38_store.get_substring(chr1_digest, 1000000, 1001000)

# Subsequent accesses use the cached chr1 data
another_region = hg38_store.get_substring(chr1_digest, 2000000, 2001000)
'''
print("See code cell for real-world usage example")
# Example: Loading a remote genome (pseudocode - URL would need to be real)
'''
# Load human genome hg38 from a remote server
cache_path = "/data/refget_cache/hg38"
remote_url = "https://refget-server.example.com/hg38"

hg38_store = GlobalRefgetStore.load_remote(cache_path, remote_url)

# Get chr1 sequence (only chr1 data is downloaded and cached)
chr1_digest = "..."
sequence = hg38_store.get_substring(chr1_digest, 1000000, 1001000)

# Subsequent accesses use the cached chr1 data
another_region = hg38_store.get_substring(chr1_digest, 2000000, 2001000)
'''
print("See code cell for real-world usage example")

Part 3: Comparison - Local vs Remote Loading¶

When to Use Each Method¶

Method	Use Case	Pros	Cons
`load_local(path)`	Store is already on disk	Fast, no network	Requires full download
`load_remote(cache, url)`	Store hosted remotely	Only download what you need	Slower first access

Performance Characteristics¶

Local Store:

Initial load: Fast (metadata only)
First sequence access: Fast (local disk)
Subsequent access: Fast (cached in memory)

Remote Store:

Initial load: 2 HTTP requests (index.json, sequences.farg)
First sequence access: 1 HTTP request + cache write
Subsequent access: Fast (cached on disk)

Benefits of Lazy Loading¶

Memory efficient: Only loaded sequences consume memory
Bandwidth efficient: Only download what you use
Fast startup: No waiting for full genome download
Selective caching: Control which sequences to cache locally

Cleanup¶

Remove temporary files:

In [ ]:

Copied!

import shutil

# Clean up temporary directory
shutil.rmtree(temp_dir)
print(f"Cleaned up temporary directory: {temp_dir}")
import shutil

# Clean up temporary directory
shutil.rmtree(temp_dir)
print(f"Cleaned up temporary directory: {temp_dir}")

Summary¶

This notebook demonstrated:

✅ Creating and populating a RefgetStore from FASTA files
✅ Retrieving sequences by digest and collection name
✅ Extracting substrings and BED file regions
✅ Extracting complete FASTA files from sequence collections
✅ Saving stores to disk and loading them back
✅ Loading remote RefgetStores with lazy loading
✅ On-demand sequence fetching with local caching

For more information, see the gtars documentation.