Refget Module API Reference
refget
Type stubs and documentation for the gtars.refget module.
This file serves two purposes:
-
Type Hints: Provides type annotations for IDE autocomplete and static type checking tools like mypy.
-
Documentation: Contains Google-style docstrings that mkdocstrings uses to generate the API reference documentation website.
Note: The actual implementation is in Rust (gtars-python/src/refget/mod.rs) and compiled via PyO3. This stub file provides the Python interface definition and structured documentation that tools can parse properly.
Classes
AlphabetType
Bases: Enum
Represents the type of alphabet for a sequence.
Attributes
Dna2bit
instance-attribute
Dna2bit: int
Dna3bit
instance-attribute
Dna3bit: int
DnaIupac
instance-attribute
DnaIupac: int
Protein
instance-attribute
Protein: int
Ascii
instance-attribute
Ascii: int
Unknown
instance-attribute
Unknown: int
Functions
__str__
__str__() -> str
SequenceMetadata
Metadata for a biological sequence.
Attributes
name
instance-attribute
name: str
length
instance-attribute
length: int
sha512t24u
instance-attribute
sha512t24u: str
md5
instance-attribute
md5: str
alphabet
instance-attribute
alphabet: AlphabetType
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceRecord
A record representing a biological sequence, including its metadata and optional data.
Attributes
metadata
instance-attribute
metadata: SequenceMetadata
sequence
instance-attribute
sequence: Optional[bytes]
Functions
decode
decode() -> Optional[str]
Decode and return the sequence data as a string.
For Full records with sequence data, returns the decoded sequence. For Stub records without sequence data, returns None.
Returns:
-
Optional[str]âDecoded sequence string if data is available, None otherwise.
__repr__
__repr__() -> str
__str__
__str__() -> str
SeqColDigestLvl1
Level 1 digests for a sequence collection.
Attributes
sequences_digest
instance-attribute
sequences_digest: str
names_digest
instance-attribute
names_digest: str
lengths_digest
instance-attribute
lengths_digest: str
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceCollectionMetadata
Metadata for a sequence collection.
Contains the collection digest and level 1 digests for names, sequences, and lengths. This is a lightweight representation of a collection without the actual sequence list.
Attributes:
-
digest(str) âThe collection's SHA-512/24u digest.
-
n_sequences(int) âNumber of sequences in the collection.
-
names_digest(str) âLevel 1 digest of the names array.
-
sequences_digest(str) âLevel 1 digest of the sequences array.
-
lengths_digest(str) âLevel 1 digest of the lengths array.
Attributes
digest
instance-attribute
digest: str
n_sequences
instance-attribute
n_sequences: int
names_digest
instance-attribute
names_digest: str
sequences_digest
instance-attribute
sequences_digest: str
lengths_digest
instance-attribute
lengths_digest: str
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceCollection
A collection of biological sequences.
Attributes
sequences
instance-attribute
sequences: List[SequenceRecord]
digest
instance-attribute
digest: str
lvl1
instance-attribute
lvl1: SeqColDigestLvl1
file_path
instance-attribute
file_path: Optional[str]
has_data
instance-attribute
has_data: bool
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceCollectionRecord
A record representing a sequence collection, which may be a Stub or Full.
Stub records contain only metadata (digest, n_sequences, level 1 digests). Full records contain metadata plus the list of SequenceRecord objects.
Attributes
metadata
instance-attribute
metadata: SequenceCollectionMetadata
sequences
property
sequences: Optional[List[SequenceRecord]]
Get the sequences if loaded (Full), None if stub-only.
Functions
has_sequences
has_sequences() -> bool
Check if this record has sequences loaded (is Full, not Stub).
__repr__
__repr__() -> str
__str__
__str__() -> str
RetrievedSequence
Represents a retrieved sequence segment with its metadata.
Exposed from the Rust PyRetrievedSequence struct.
Attributes
sequence
instance-attribute
sequence: str
chrom_name
instance-attribute
chrom_name: str
start
instance-attribute
start: int
end
instance-attribute
end: int
Functions
__init__
__init__(sequence: str, chrom_name: str, start: int, end: int) -> None
__repr__
__repr__() -> str
__str__
__str__() -> str
StorageMode
Bases: Enum
Defines how sequence data is stored in the Refget store.
Attributes
Raw
instance-attribute
Raw: int
Encoded
instance-attribute
Encoded: int
RefgetStore
A global store for GA4GH refget sequences with lazy-loading support.
RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.
Attributes:
-
cache_path(Optional[str]) âLocal directory path where the store is located or cached. None for in-memory stores.
-
remote_url(Optional[str]) âRemote URL of the store if loaded remotely, None otherwise.
Note
Boolean evaluation: RefgetStore follows Python container semantics,
meaning bool(store) is False for empty stores (like list,
dict, etc.). To check if a store variable is initialized (not None),
use if store is not None: rather than if store:.
Example::
store = RefgetStore.in_memory() # Empty store
bool(store) # False (empty container)
len(store) # 0
# Wrong: checks emptiness, not initialization
if store:
process(store)
# Right: checks if variable is set
if store is not None:
process(store)
Examples:
Create a new store and import sequences::
from gtars.refget import RefgetStore, StorageMode
store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")
Open an existing local store::
store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)
Open a remote store with caching::
store = RefgetStore.open_remote(
"/local/cache",
"https://example.com/hg38"
)
Attributes
cache_path
instance-attribute
cache_path: Optional[str]
remote_url
instance-attribute
remote_url: Optional[str]
Functions
__init__
__init__(mode: StorageMode) -> None
Create a new empty RefgetStore.
Parameters:
-
mode(StorageMode) âStorage mode - StorageMode.Raw (uncompressed) or StorageMode.Encoded (bit-packed, space-efficient).
Example::
store = RefgetStore(StorageMode.Encoded)
in_memory
classmethod
in_memory() -> RefgetStore
Create a new in-memory RefgetStore.
Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.
Returns:
-
RefgetStoreâNew empty RefgetStore with Encoded storage mode.
Example::
store = RefgetStore.in_memory()
store.import_fasta("genome.fa")
on_disk
classmethod
on_disk(cache_path: Union[str, PathLike]) -> RefgetStore
Create or load a disk-backed RefgetStore.
If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.
Parameters:
-
cache_path(Union[str, PathLike]) âDirectory path for the store. Created if it doesn't exist.
Returns:
-
RefgetStoreâRefgetStore (new or loaded from disk).
Example::
store = RefgetStore.on_disk("/data/my_store")
store.import_fasta("genome.fa")
# Store is automatically persisted to disk
open_local
classmethod
open_local(path: Union[str, PathLike]) -> RefgetStore
Open a local RefgetStore from a directory.
Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().
Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi
Parameters:
-
path(Union[str, PathLike]) âLocal directory containing the refget store.
Returns:
-
RefgetStoreâRefgetStore with metadata loaded, sequences lazy-loaded.
Raises:
-
IOErrorâIf the store directory or index files cannot be read.
Example::
store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)
open_remote
classmethod
open_remote(cache_path: Union[str, PathLike], remote_url: str) -> RefgetStore
Open a remote RefgetStore with local caching.
Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.
By default, persistence is enabled (sequences are cached to disk).
Call disable_persistence() after loading to keep only in memory.
Parameters:
-
cache_path(Union[str, PathLike]) âLocal directory to cache downloaded metadata and sequences. Created if it doesn't exist.
-
remote_url(str) âBase URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38").
Returns:
-
RefgetStoreâRefgetStore with metadata loaded, sequences fetched on-demand.
Raises:
-
IOErrorâIf remote metadata cannot be fetched or cache cannot be written.
Example::
store = RefgetStore.open_remote(
"/data/cache/hg38",
"https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)
set_encoding_mode
set_encoding_mode(mode: StorageMode) -> None
Change the storage mode, re-encoding/decoding existing sequences as needed.
When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.
Parameters:
-
mode(StorageMode) âThe storage mode to switch to (StorageMode.Raw or StorageMode.Encoded).
Example::
store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)
enable_persistence
enable_persistence(path: Union[str, PathLike]) -> None
Enable disk persistence for this store.
Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.
Parameters:
-
path(Union[str, PathLike]) âDirectory for storing sequences and metadata.
Raises:
-
IOErrorâIf the directory cannot be created or written to.
Example::
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store") # Flush to disk
disable_persistence
disable_persistence() -> None
Disable disk persistence for this store.
New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.
Example::
store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence() # Stop caching new sequences
import_fasta
import_fasta(file_path: Union[str, PathLike]) -> None
Import sequences from a FASTA file into the store.
Reads all sequences from a FASTA file and adds them to the store. Computes GA4GH digests and creates a sequence collection.
Parameters:
-
file_path(Union[str, PathLike]) âPath to the FASTA file.
Raises:
-
IOErrorâIf the file cannot be read or parsed.
Example::
store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")
list_collections
list_collections() -> List[SequenceCollectionMetadata]
List all collection metadata in the store.
Returns metadata for all collections without loading full collection data. Use this for browsing/inventory operations.
Returns:
-
List[SequenceCollectionMetadata]âList of metadata for all collections.
Example::
for meta in store.list_collections():
print(f"Collection {meta.digest}: {meta.n_sequences} sequences")
get_collection_metadata
get_collection_metadata(collection_digest: str) -> Optional[SequenceCollectionMetadata]
Get metadata for a collection by digest.
Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
Optional[SequenceCollectionMetadata]âCollection metadata if found, None otherwise.
Example::
meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
print(f"Collection has {meta.n_sequences} sequences")
get_collection
get_collection(collection_digest: str) -> SequenceCollection
Get a collection by digest with all sequences loaded.
Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
SequenceCollectionâThe collection with all sequence data loaded.
Raises:
-
IOErrorâIf the collection cannot be loaded.
Example::
collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
iter_collections
iter_collections() -> List[SequenceCollection]
Iterate over all collections with their sequences loaded.
This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.
For browsing without loading data, use list_collections() instead.
Returns:
-
List[SequenceCollection]âList of all collections with loaded sequences.
Example::
for coll in store.iter_collections():
print(f"{coll.digest}: {len(coll.sequences)} sequences")
is_collection_loaded
is_collection_loaded(collection_digest: str) -> bool
Check if a collection is fully loaded.
Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
boolâTrue if loaded, False otherwise.
list_sequences
list_sequences() -> List[SequenceMetadata]
List all sequence metadata in the store.
Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.
Returns:
-
List[SequenceMetadata]âList of metadata for all sequences in the store.
Example::
for meta in store.list_sequences():
print(f"{meta.name}: {meta.length} bp")
get_sequence_metadata
get_sequence_metadata(seq_digest: str) -> Optional[SequenceMetadata]
Get metadata for a sequence by digest (no data loaded).
Use this for lightweight lookups when you don't need the actual sequence.
Parameters:
-
seq_digest(str) âThe sequence's SHA-512/24u digest.
Returns:
-
Optional[SequenceMetadata]âSequence metadata if found, None otherwise.
get_sequence
get_sequence(digest: str) -> Optional[SequenceRecord]
Retrieve a sequence record by its digest (SHA-512/24u or MD5).
Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest.
Parameters:
-
digest(str) âSequence digest (SHA-512/24u base64url or MD5 hex string).
Returns:
-
Optional[SequenceRecord]âThe sequence record with data if found, None otherwise.
Example::
record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
if record:
print(f"Found: {record.metadata.name}")
print(f"Sequence: {record.decode()[:50]}...")
get_sequence_by_name
get_sequence_by_name(collection_digest: str, sequence_name: str) -> Optional[SequenceRecord]
Retrieve a sequence by collection digest and sequence name.
Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
-
sequence_name(str) âName of the sequence within that collection.
Returns:
-
Optional[SequenceRecord]âThe sequence record with data if found, None otherwise.
Example::
record = store.get_sequence_by_name(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"chr1"
)
if record:
print(f"Sequence: {record.decode()[:50]}...")
iter_sequences
iter_sequences() -> List[SequenceRecord]
Iterate over all sequences with their data loaded.
This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.
For browsing without loading data, use list_sequences() instead.
Returns:
-
List[SequenceRecord]âList of all sequences with loaded data.
Example::
for seq in store.iter_sequences():
print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
get_substring
get_substring(seq_digest: str, start: int, end: int) -> Optional[str]
Extract a substring from a sequence.
Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores).
Parameters:
-
seq_digest(str) âSequence digest (SHA-512/24u).
-
start(int) âStart position (0-based, inclusive).
-
end(int) âEnd position (0-based, exclusive).
Returns:
-
Optional[str]âThe substring sequence if found, None otherwise.
Example::
# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")
stats
stats() -> dict
Returns statistics about the store.
Returns:
-
dictâdict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded') - 'total_disk_size': Total size of all files on disk in bytes
Note
n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.
Example::
stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")
write_store_to_directory
write_store_to_directory(root_path: Union[str, PathLike], seqdata_path_template: str) -> None
Write the store to a directory on disk.
Persists the store with all sequences and metadata to disk using the RefgetStore directory format.
Parameters:
-
root_path(Union[str, PathLike]) âDirectory path to write the store to.
-
seqdata_path_template(str) âPath template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest).
Example::
store.write_store_to_directory(
"/data/my_store",
"sequences/%s2/%s.seq"
)
get_seqs_bed_file
get_seqs_bed_file(collection_digest: str, bed_file_path: Union[str, PathLike], output_fasta_path: Union[str, PathLike]) -> None
Extract sequences for BED regions and write to FASTA.
Parameters:
-
collection_digest(str) âCollection digest to look up sequence names.
-
bed_file_path(Union[str, PathLike]) âPath to BED file with regions.
-
output_fasta_path(Union[str, PathLike]) âPath to write output FASTA file.
get_seqs_bed_file_to_vec
get_seqs_bed_file_to_vec(collection_digest: str, bed_file_path: Union[str, PathLike]) -> List[RetrievedSequence]
Extract sequences for BED regions and return as list.
Parameters:
-
collection_digest(str) âCollection digest to look up sequence names.
-
bed_file_path(Union[str, PathLike]) âPath to BED file with regions.
Returns:
-
List[RetrievedSequence]âList of retrieved sequence segments.
export_fasta
export_fasta(collection_digest: str, output_path: Union[str, PathLike], sequence_names: Optional[List[str]] = None, line_width: Optional[int] = None) -> None
Export sequences from a collection to a FASTA file.
Parameters:
-
collection_digest(str) âCollection to export from.
-
output_path(Union[str, PathLike]) âPath to write FASTA file.
-
sequence_names(Optional[List[str]], default:None) âOptional list of sequence names to export. If None, exports all sequences in the collection.
-
line_width(Optional[int], default:None) âOptional line width for wrapping sequences. If None, uses default of 80.
export_fasta_by_digests
export_fasta_by_digests(digests: List[str], output_path: Union[str, PathLike], line_width: Optional[int] = None) -> None
Export sequences by their digests to a FASTA file.
Parameters:
-
digests(List[str]) âList of sequence digests to export.
-
output_path(Union[str, PathLike]) âPath to write FASTA file.
-
line_width(Optional[int], default:None) âOptional line width for wrapping sequences. If None, uses default of 80.
__str__
__str__() -> str
__repr__
__repr__() -> str
Functions
sha512t24u_digest
sha512t24u_digest(readable: Union[str, bytes]) -> str
Compute the GA4GH SHA-512/24u digest for a sequence.
This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.
Parameters:
-
readable(Union[str, bytes]) âInput sequence as str or bytes.
Returns:
-
strâThe SHA-512/24u digest (32 character base64url string).
Raises:
-
TypeErrorâIf input is not str or bytes.
Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'
md5_digest
md5_digest(readable: Union[str, bytes]) -> str
Compute the MD5 digest for a sequence.
This function computes the MD5 hash for a given sequence string or bytes. MD5 is supported for backward compatibility with legacy systems.
Parameters:
-
readable(Union[str, bytes]) âInput sequence as str or bytes.
Returns:
-
strâThe MD5 digest (32 character hexadecimal string).
Raises:
-
TypeErrorâIf input is not str or bytes.
Example:: from gtars.refget import md5_digest digest = md5_digest("ACGT") print(digest) # Output: 'f1f8f4bf413b16ad135722aa4591043e'
digest_fasta
digest_fasta(fasta: Union[str, PathLike]) -> SequenceCollection
Digest all sequences in a FASTA file and compute collection-level digests.
This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.
Parameters:
-
fasta(Union[str, PathLike]) âPath to FASTA file (str or PathLike).
Returns:
-
SequenceCollectionâCollection containing all sequences with their metadata and computed digests.
Raises:
-
IOErrorâIf the FASTA file cannot be read or parsed.
Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")
compute_fai
compute_fai(fasta: Union[str, PathLike]) -> List[FaiRecord]
Compute FASTA index (FAI) metadata for all sequences in a FASTA file.
This function computes the FAI index metadata (offset, line_bases, line_bytes) for each sequence in a FASTA file, compatible with samtools faidx format. Only works with uncompressed FASTA files.
Parameters:
-
fasta(Union[str, PathLike]) âPath to FASTA file (str or PathLike). Must be uncompressed.
Returns:
-
List[FaiRecord]âList of FAI records, one per sequence, containing name, length,
-
List[FaiRecord]âand FAI metadata (offset, line_bases, line_bytes).
Raises:
-
IOErrorâIf the FASTA file cannot be read or is compressed.
Example:: from gtars.refget import compute_fai fai_records = compute_fai("genome.fa") for record in fai_records: ... print(f"{record.name}: {record.length} bp")
load_fasta
load_fasta(fasta: Union[str, PathLike]) -> SequenceCollection
Load a FASTA file with sequence data into a SequenceCollection.
This function reads a FASTA file and loads all sequences with their data into memory. Unlike digest_fasta(), this includes the actual sequence data, not just metadata.
Parameters:
-
fasta(Union[str, PathLike]) âPath to FASTA file (str or PathLike).
Returns:
-
SequenceCollectionâCollection containing all sequences with their metadata and sequence data loaded.
Raises:
-
IOErrorâIf the FASTA file cannot be read or parsed.
Example:: from gtars.refget import load_fasta collection = load_fasta("genome.fa") first_seq = collection[0] print(f"Sequence: {first_seq.data[:50]}...")
digest_sequence
digest_sequence(name: str, data: bytes, description: Optional[str] = None) -> SequenceRecord
Create a SequenceRecord from raw data, computing all metadata.
This is the sequence-level parallel to digest_fasta() for collections. It computes the GA4GH sha512t24u digest, MD5 digest, detects the alphabet, and returns a SequenceRecord with computed metadata and the original data.
The input data is automatically uppercased to ensure consistent digest computation (matching FASTA processing behavior).
Parameters:
-
name(str) âThe sequence name (e.g., "chr1").
-
data(bytes) âThe raw sequence bytes (e.g., b"ACGTACGT").
-
description(Optional[str], default:None) âOptional description text for the sequence.
Returns:
-
SequenceRecordâA SequenceRecord with computed metadata and the original data (uppercased).
Example:: from gtars.refget import digest_sequence seq = digest_sequence("chr1", b"ACGTACGT") print(seq.metadata.name, seq.metadata.length) # Output: chr1 8
# With description
seq2 = digest_sequence("chr1", b"ACGT", description="Chromosome 1")
print(seq2.metadata.description)
# Output: Chromosome 1