Refget Module API Reference
refget
Type stubs and documentation for the gtars.refget module.
This file serves two purposes:
-
Type Hints: Provides type annotations for IDE autocomplete and static type checking tools like mypy.
-
Documentation: Contains Google-style docstrings that mkdocstrings uses to generate the API reference documentation website.
Note: The actual implementation is in Rust (gtars-python/src/refget/mod.rs) and compiled via PyO3. This stub file provides the Python interface definition and structured documentation that tools can parse properly.
Classes
AlphabetType
Bases: Enum
Represents the type of alphabet for a sequence.
Attributes
Dna2bit
instance-attribute
Dna2bit: int
Dna3bit
instance-attribute
Dna3bit: int
DnaIupac
instance-attribute
DnaIupac: int
Protein
instance-attribute
Protein: int
Ascii
instance-attribute
Ascii: int
Unknown
instance-attribute
Unknown: int
Functions
__str__
__str__() -> str
FaiMetadata
FASTA index (FAI) metadata for a sequence.
Contains the information needed to quickly seek to a sequence in a FASTA file, compatible with samtools faidx format.
Attributes:
-
offset(int) âByte offset of the first base in the FASTA file.
-
line_bases(int) âNumber of bases per line.
-
line_bytes(int) âNumber of bytes per line (including newline).
Attributes
offset
instance-attribute
offset: int
line_bases
instance-attribute
line_bases: int
line_bytes
instance-attribute
line_bytes: int
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
FaiRecord
A FASTA index record for a single sequence.
Represents one line of a .fai index file with sequence name, length, and FAI metadata for random access.
Attributes:
-
name(str) âSequence name.
-
length(int) âSequence length in bases.
-
fai(Optional[FaiMetadata]) âFAI metadata (None for gzipped files).
Attributes
name
instance-attribute
name: str
length
instance-attribute
length: int
fai
instance-attribute
fai: Optional[FaiMetadata]
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceMetadata
Metadata for a biological sequence.
Contains identifying information and computed digests for a sequence, without the actual sequence data.
Attributes:
-
name(str) âSequence name (first word of FASTA header).
-
description(Optional[str]) âDescription from FASTA header (text after first whitespace).
-
length(int) âLength of the sequence in bases.
-
sha512t24u(str) âGA4GH SHA-512/24u digest (32-char base64url).
-
md5(str) âMD5 digest (32-char hex string).
-
alphabet(AlphabetType) âDetected alphabet type (DNA, protein, etc.).
-
fai(Optional[FaiMetadata]) âFASTA index metadata if available.
Attributes
name
instance-attribute
name: str
description
instance-attribute
description: Optional[str]
length
instance-attribute
length: int
sha512t24u
instance-attribute
sha512t24u: str
md5
instance-attribute
md5: str
alphabet
instance-attribute
alphabet: AlphabetType
fai
instance-attribute
fai: Optional[FaiMetadata]
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceRecord
A record representing a biological sequence, including its metadata and optional data.
SequenceRecord can be either a "stub" (metadata only) or "full" (metadata + data). Stubs are used for lazy-loading where sequence data is fetched on demand.
Attributes:
-
metadata(SequenceMetadata) âSequence metadata (name, length, digests).
-
sequence(Optional[bytes]) âRaw sequence data if loaded, None for stubs.
-
is_loaded(bool) âWhether sequence data is loaded (True) or just metadata (False).
Attributes
metadata
instance-attribute
metadata: SequenceMetadata
sequence
instance-attribute
sequence: Optional[bytes]
is_loaded
property
is_loaded: bool
Whether sequence data is loaded (true) or just metadata (false).
Functions
decode
decode() -> Optional[str]
Decode and return the sequence data as a string.
For Full records with sequence data, returns the decoded sequence. For Stub records without sequence data, returns None.
Returns:
-
Optional[str]âDecoded sequence string if data is available, None otherwise.
__repr__
__repr__() -> str
__str__
__str__() -> str
SeqColDigestLvl1
Level 1 digests for a sequence collection.
Attributes
sequences_digest
instance-attribute
sequences_digest: str
names_digest
instance-attribute
names_digest: str
lengths_digest
instance-attribute
lengths_digest: str
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceCollectionMetadata
Metadata for a sequence collection.
Contains the collection digest and level 1 digests for names, sequences, and lengths. This is a lightweight representation of a collection without the actual sequence list.
Attributes:
-
digest(str) âThe collection's SHA-512/24u digest.
-
n_sequences(int) âNumber of sequences in the collection.
-
names_digest(str) âLevel 1 digest of the names array.
-
sequences_digest(str) âLevel 1 digest of the sequences array.
-
lengths_digest(str) âLevel 1 digest of the lengths array.
-
name_length_pairs_digest(Optional[str]) âAncillary digest (if computed).
-
sorted_name_length_pairs_digest(Optional[str]) âAncillary digest (if computed).
-
sorted_sequences_digest(Optional[str]) âAncillary digest (if computed).
Attributes
digest
instance-attribute
digest: str
n_sequences
instance-attribute
n_sequences: int
names_digest
instance-attribute
names_digest: str
sequences_digest
instance-attribute
sequences_digest: str
lengths_digest
instance-attribute
lengths_digest: str
name_length_pairs_digest
instance-attribute
name_length_pairs_digest: Optional[str]
sorted_name_length_pairs_digest
instance-attribute
sorted_name_length_pairs_digest: Optional[str]
sorted_sequences_digest
instance-attribute
sorted_sequences_digest: Optional[str]
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str
SequenceCollection
A collection of biological sequences (e.g., a genome assembly).
SequenceCollection represents a set of sequences with collection-level digests following the GA4GH seqcol specification. Supports iteration, indexing, and len().
Attributes:
-
sequences(List[SequenceRecord]) âList of sequence records.
-
digest(str) âCollection-level SHA-512/24u digest (Level 2).
-
lvl1(SeqColDigestLvl1) âLevel 1 digests for names, lengths, sequences.
-
file_path(Optional[str]) âSource file path if loaded from FASTA.
Examples:
Iterate over sequences::
for seq in collection:
print(f"{seq.metadata.name}: {seq.metadata.length} bp")
Access by index::
first_seq = collection[0]
last_seq = collection[-1]
Get length::
n = len(collection)
Attributes
sequences
instance-attribute
sequences: List[SequenceRecord]
digest
instance-attribute
digest: str
lvl1
instance-attribute
lvl1: SeqColDigestLvl1
file_path
instance-attribute
file_path: Optional[str]
Functions
write_fasta
write_fasta(file_path: str, line_width: Optional[int] = None) -> None
Write the collection to a FASTA file.
Parameters:
-
file_path(str) âPath to the output FASTA file.
-
line_width(Optional[int], default:None) âNumber of bases per line (default: 70).
Raises:
-
IOErrorâIf any sequence doesn't have data loaded.
Example::
collection = load_fasta("genome.fa")
collection.write_fasta("output.fa")
collection.write_fasta("output.fa", line_width=60)
__len__
__len__() -> int
__getitem__
__getitem__(idx: int) -> SequenceRecord
__iter__
__iter__() -> Iterator[SequenceRecord]
__repr__
__repr__() -> str
__str__
__str__() -> str
RetrievedSequence
Represents a retrieved sequence segment with its metadata.
Returned by methods that extract subsequences from specific regions, such as substrings_from_regions().
Attributes:
-
sequence(str) âThe extracted sequence string.
-
chrom_name(str) âChromosome/sequence name (e.g., "chr1").
-
start(int) âStart position (0-based, inclusive).
-
end(int) âEnd position (0-based, exclusive).
Attributes
sequence
instance-attribute
sequence: str
chrom_name
instance-attribute
chrom_name: str
start
instance-attribute
start: int
end
instance-attribute
end: int
Functions
__init__
__init__(sequence: str, chrom_name: str, start: int, end: int) -> None
__repr__
__repr__() -> str
__str__
__str__() -> str
StorageMode
Bases: Enum
Defines how sequence data is stored in the Refget store.
Variants
Raw: Store sequences as raw bytes (1 byte per base). Encoded: Store sequences with 2-bit encoding (4 bases per byte).
Attributes
Raw
instance-attribute
Raw: int
Encoded
instance-attribute
Encoded: int
FhrMetadata
FAIR Headers Reference genome metadata for a sequence collection.
Fields match the FHR 1.0 specification. All fields are optional.
Note: schema_version is a number (int or float) per spec, passed as
a Python numeric type and stored as serde_json::Number internally.
Attributes
genome
instance-attribute
genome: Optional[str]
version
instance-attribute
version: Optional[str]
masking
instance-attribute
masking: Optional[str]
genome_synonym
instance-attribute
genome_synonym: Optional[list[str]]
voucher_specimen
instance-attribute
voucher_specimen: Optional[str]
documentation
instance-attribute
documentation: Optional[str]
identifier
instance-attribute
identifier: Optional[list[str]]
scholarly_article
instance-attribute
scholarly_article: Optional[str]
funding
instance-attribute
funding: Optional[str]
Functions
__init__
__init__(**kwargs: Any) -> None
from_json
staticmethod
from_json(path: str) -> FhrMetadata
to_dict
to_dict() -> dict[str, Any]
to_json
to_json(path: str) -> None
__repr__
__repr__() -> str
RefgetStore
A global store for GA4GH refget sequences with lazy-loading support.
RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.
Attributes:
-
cache_path(Optional[str]) âLocal directory path where the store is located or cached. None for in-memory stores.
-
remote_url(Optional[str]) âRemote URL of the store if loaded remotely, None otherwise.
-
quiet(bool) âWhether the store suppresses progress output.
-
storage_mode(StorageMode) âCurrent storage mode (Raw or Encoded).
Note
Boolean evaluation: RefgetStore follows Python container semantics,
meaning bool(store) is False for empty stores (like list,
dict, etc.). To check if a store variable is initialized (not None),
use if store is not None: rather than if store:.
Example::
store = RefgetStore.in_memory() # Empty store
bool(store) # False (empty container)
len(store) # 0
# Wrong: checks emptiness, not initialization
if store:
process(store)
# Right: checks if variable is set
if store is not None:
process(store)
Examples:
Create a new store and import sequences::
from gtars.refget import RefgetStore
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
Open an existing local store::
store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)
Open a remote store with caching::
store = RefgetStore.open_remote(
"/local/cache",
"https://example.com/hg38"
)
Attributes
cache_path
instance-attribute
cache_path: Optional[str]
remote_url
instance-attribute
remote_url: Optional[str]
quiet
property
quiet: bool
Whether the store is in quiet mode.
storage_mode
property
storage_mode: StorageMode
Current storage mode (Raw or Encoded).
is_persisting
property
is_persisting: bool
Whether the store is currently persisting to disk.
Example::
store = RefgetStore.in_memory()
print(store.is_persisting) # False
store.enable_persistence("/data/store")
print(store.is_persisting) # True
Functions
in_memory
classmethod
in_memory() -> RefgetStore
Create a new in-memory RefgetStore.
Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.
Returns:
-
RefgetStoreâNew empty RefgetStore with Encoded storage mode.
Example::
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store_exists
classmethod
store_exists(path: Union[str, PathLike]) -> bool
Check whether a valid RefgetStore exists at the given path.
Returns True if the path contains a store manifest file, indicating the store has been initialized. Returns False if the path does not exist or does not contain a store.
This avoids hardcoding knowledge of the store's internal file format in calling code.
Parameters:
-
path(Union[str, PathLike]) âPath to the store directory.
Returns:
-
boolâTrue if a store exists at the path, False otherwise.
Example::
from gtars.refget import RefgetStore
RefgetStore.store_exists("/data/hg38_store") # True
RefgetStore.store_exists("/tmp/empty") # False
on_disk
classmethod
on_disk(cache_path: Union[str, PathLike]) -> RefgetStore
Create or load a disk-backed RefgetStore.
If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.
Parameters:
-
cache_path(Union[str, PathLike]) âDirectory path for the store. Created if it doesn't exist.
Returns:
-
RefgetStoreâRefgetStore (new or loaded from disk).
Example::
store = RefgetStore.on_disk("/data/my_store")
store.add_sequence_collection_from_fasta("genome.fa")
# Store is automatically persisted to disk
open_local
classmethod
open_local(path: Union[str, PathLike]) -> RefgetStore
Open a local RefgetStore from a directory.
Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().
Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi
Parameters:
-
path(Union[str, PathLike]) âLocal directory containing the refget store.
Returns:
-
RefgetStoreâRefgetStore with metadata loaded, sequences lazy-loaded.
Raises:
-
IOErrorâIf the store directory or index files cannot be read.
Example::
store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)
open_remote
classmethod
open_remote(cache_path: Union[str, PathLike], remote_url: str) -> RefgetStore
Open a remote RefgetStore with local caching.
Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.
By default, persistence is enabled (sequences are cached to disk).
Call disable_persistence() after loading to keep only in memory.
Parameters:
-
cache_path(Union[str, PathLike]) âLocal directory to cache downloaded metadata and sequences. Created if it doesn't exist.
-
remote_url(str) âBase URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38").
Returns:
-
RefgetStoreâRefgetStore with metadata loaded, sequences fetched on-demand.
Raises:
-
IOErrorâIf remote metadata cannot be fetched or cache cannot be written.
Example::
store = RefgetStore.open_remote(
"/data/cache/hg38",
"https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)
set_encoding_mode
set_encoding_mode(mode: StorageMode) -> None
Change the storage mode, re-encoding/decoding existing sequences as needed.
When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.
Parameters:
-
mode(StorageMode) âThe storage mode to switch to (StorageMode.Raw or StorageMode.Encoded).
Example::
store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)
enable_encoding
enable_encoding() -> None
Enable 2-bit encoding for space efficiency.
Re-encodes any existing Raw sequences in memory.
Example::
store = RefgetStore.in_memory()
store.disable_encoding() # Switch to Raw
store.enable_encoding() # Back to Encoded
disable_encoding
disable_encoding() -> None
Disable encoding, use raw byte storage.
Decodes any existing Encoded sequences in memory.
Example::
store = RefgetStore.in_memory()
store.disable_encoding() # Switch to Raw mode
set_quiet
set_quiet(quiet: bool) -> None
Set whether to suppress progress output.
When quiet is True, operations like add_sequence_collection_from_fasta will not print progress messages.
Parameters:
-
quiet(bool) âWhether to suppress progress output.
Example::
store = RefgetStore.in_memory()
store.set_quiet(True)
store.add_sequence_collection_from_fasta("genome.fa") # No output
enable_persistence
enable_persistence(path: Union[str, PathLike]) -> None
Enable disk persistence for this store.
Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.
Parameters:
-
path(Union[str, PathLike]) âDirectory for storing sequences and metadata.
Raises:
-
IOErrorâIf the directory cannot be created or written to.
Example::
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store") # Flush to disk
disable_persistence
disable_persistence() -> None
Disable disk persistence for this store.
New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.
Example::
store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence() # Stop caching new sequences
add_sequence_collection_from_fasta
add_sequence_collection_from_fasta(file_path: Union[str, PathLike], force: bool = False, namespaces: Optional[List[str]] = None) -> tuple[SequenceCollectionMetadata, bool]
Add a sequence collection from a FASTA file.
Reads a FASTA file, digests the sequences, creates a SequenceCollection, and adds it to the store along with all its sequences.
Parameters:
-
file_path(Union[str, PathLike]) âPath to the FASTA file to import.
-
force(bool, default:False) âIf True, overwrite existing collections/sequences. If False (default), skip duplicates.
-
namespaces(Optional[List[str]], default:None) âOptional list of namespace prefixes to extract aliases from FASTA headers. For example, ["ncbi", "refseq"] will scan headers for tokens like
ncbi:NC_000001.11and register them as aliases.
Returns:
-
tuple[SequenceCollectionMetadata, bool]âA tuple containing: - SequenceCollectionMetadata: Metadata for the collection. - bool: True if the collection was newly added, False if it already existed.
Raises:
-
IOErrorâIf the file cannot be read or processed.
Example::
store = RefgetStore.in_memory()
metadata, was_new = store.add_sequence_collection_from_fasta("genome.fa")
print(f"{'Added' if was_new else 'Skipped'}: {metadata.digest}")
# Extract aliases from FASTA headers
metadata, was_new = store.add_sequence_collection_from_fasta(
"genome.fa", namespaces=["ncbi", "refseq"]
)
add_sequence_collection
add_sequence_collection(collection: SequenceCollection, force: bool = False) -> None
Add a pre-built SequenceCollection to the store.
Adds a SequenceCollection (created via digest_fasta() or programmatically)
directly to the store without reading from a FASTA file.
Parameters:
-
collection(SequenceCollection) âA SequenceCollection to add.
-
force(bool, default:False) âIf True, overwrite existing collections/sequences. If False (default), skip duplicates.
Raises:
-
IOErrorâIf the collection cannot be stored.
Example::
from gtars.refget import RefgetStore, digest_fasta
store = RefgetStore.in_memory()
collection = digest_fasta("genome.fa")
store.add_sequence_collection(collection)
add_sequence
add_sequence(sequence: SequenceRecord, force: bool = False) -> None
Add a sequence to the store without collection association.
The sequence can be created using digest_sequence() and later
retrieved by its digest via get_sequence().
Parameters:
-
sequence(SequenceRecord) âA SequenceRecord created by
digest_sequence(). -
force(bool, default:False) âIf True, overwrite existing. If False (default), skip duplicates.
Raises:
-
IOErrorâIf the sequence cannot be stored.
Example::
from gtars.refget import RefgetStore, digest_sequence
store = RefgetStore.in_memory()
seq = digest_sequence(b"ACGTACGT")
store.add_sequence(seq)
retrieved = store.get_sequence(seq.metadata.sha512t24u)
list_collections
list_collections(page: int = 0, page_size: int = 100, filters: Optional[Dict[str, str]] = None) -> Dict[str, Any]
List collections with pagination and optional attribute filtering.
Parameters:
-
page(int, default:0) â0-indexed page number.
-
page_size(int, default:100) âNumber of results per page.
-
filters(Optional[Dict[str, str]], default:None) âOptional attribute filters (AND logic). Keys are attribute names (names, lengths, sequences, name_length_pairs, sorted_name_length_pairs, sorted_sequences), values are digests.
Returns:
-
Dict[str, Any]âDict with "results" (list of SequenceCollectionMetadata) and
-
Dict[str, Any]â"pagination" (dict with page, page_size, total).
Example::
# Get first page of all collections
result = store.list_collections()
for meta in result["results"]:
print(f"{meta.digest}: {meta.n_sequences} sequences")
print(f"Total: {result['pagination']['total']}")
# Filter by names digest
result = store.list_collections(filters={"names": "abc123"})
remove_collection
remove_collection(digest: str, remove_orphan_sequences: bool = False) -> bool
Remove a collection from the store.
Parameters:
-
digest(str) âThe collection's SHA-512/24u digest string.
-
remove_orphan_sequences(bool, default:False) âIf True, also remove sequences no longer referenced by any remaining collection. Default: False.
Returns:
-
boolâTrue if the collection was found and removed, False if not found.
get_collection_metadata
get_collection_metadata(collection_digest: str) -> Optional[SequenceCollectionMetadata]
Get metadata for a collection by digest.
Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
Optional[SequenceCollectionMetadata]âCollection metadata if found, None otherwise.
Example::
meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
print(f"Collection has {meta.n_sequences} sequences")
get_collection
get_collection(collection_digest: str) -> SequenceCollection
Get a collection by digest with all sequences loaded.
Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
SequenceCollectionâThe collection with all sequence data loaded.
Raises:
-
IOErrorâIf the collection cannot be loaded.
Example::
collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
iter_collections
iter_collections() -> List[SequenceCollection]
Iterate over all collections with their sequences loaded.
This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.
For browsing without loading data, use list_collections() instead.
Returns:
-
List[SequenceCollection]âList of all collections with loaded sequences.
Example::
for coll in store.iter_collections():
print(f"{coll.digest}: {len(coll.sequences)} sequences")
is_collection_loaded
is_collection_loaded(collection_digest: str) -> bool
Check if a collection is fully loaded.
Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
boolâTrue if loaded, False otherwise.
list_sequences
list_sequences() -> List[SequenceMetadata]
List all sequence metadata in the store.
Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.
Returns:
-
List[SequenceMetadata]âList of metadata for all sequences in the store.
Example::
for meta in store.list_sequences():
print(f"{meta.name}: {meta.length} bp")
get_sequence_metadata
get_sequence_metadata(seq_digest: str) -> Optional[SequenceMetadata]
Get metadata for a sequence by digest (no data loaded).
Use this for lightweight lookups when you don't need the actual sequence. Automatically strips "SQ." prefix from digest if present.
Parameters:
-
seq_digest(str) âThe sequence's SHA-512/24u digest, optionally with "SQ." prefix.
Returns:
-
Optional[SequenceMetadata]âSequence metadata if found, None otherwise.
get_sequence
get_sequence(digest: str) -> SequenceRecord
Retrieve a sequence record by its digest (SHA-512/24u or MD5).
Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest. Automatically strips "SQ." prefix if present (case-insensitive).
Parameters:
-
digest(str) âSequence digest (SHA-512/24u base64url or MD5 hex string), optionally with "SQ." prefix.
Returns:
-
SequenceRecordâThe sequence record with data.
Raises:
-
KeyErrorâIf the sequence is not found.
Example::
record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
print(f"Found: {record.metadata.name}")
# Also works with SQ. prefix
record = store.get_sequence("SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
get_sequence_by_name
get_sequence_by_name(collection_digest: str, sequence_name: str) -> SequenceRecord
Retrieve a sequence by collection digest and sequence name.
Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed. Automatically strips "SQ." prefix from collection digest if present.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest, optionally with "SQ." prefix.
-
sequence_name(str) âName of the sequence within that collection.
Returns:
-
SequenceRecordâThe sequence record with data.
Raises:
-
KeyErrorâIf the sequence is not found.
Example::
record = store.get_sequence_by_name(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"chr1"
)
print(f"Sequence: {record.decode()[:50]}...")
iter_sequences
iter_sequences() -> List[SequenceRecord]
Iterate over all sequences with their data loaded.
This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.
For browsing without loading data, use list_sequences() instead.
Returns:
-
List[SequenceRecord]âList of all sequences with loaded data.
Example::
for seq in store.iter_sequences():
print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
get_substring
get_substring(seq_digest: str, start: int, end: int) -> str
Extract a substring from a sequence.
Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores). Automatically strips "SQ." prefix from digest if present.
Parameters:
-
seq_digest(str) âSequence digest (SHA-512/24u), optionally with "SQ." prefix.
-
start(int) âStart position (0-based, inclusive).
-
end(int) âEnd position (0-based, exclusive).
Returns:
-
strâThe substring sequence.
Raises:
-
KeyErrorâIf the sequence is not found.
Example::
# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")
stats
stats() -> dict
Returns statistics about the store.
Returns:
-
dictâdict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded')
Note
n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.
Example::
stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")
write
write() -> None
Write the store using its configured paths.
Convenience method for disk-backed stores. Uses the store's own local_path and seqdata_path_template.
Raises:
-
IOErrorâIf the store cannot be written.
write_store_to_dir
write_store_to_dir(root_path: Union[str, PathLike], seqdata_path_template: Optional[str] = None) -> None
Write the store to a directory on disk.
Persists the store with all sequences and metadata to disk using the RefgetStore directory format.
Parameters:
-
root_path(Union[str, PathLike]) âDirectory path to write the store to.
-
seqdata_path_template(Optional[str], default:None) âOptional path template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest). Uses default if not specified.
Example::
store.write_store_to_dir("/data/my_store")
store.write_store_to_dir("/data/my_store", "sequences/%s2/%s.seq")
get_collection_level1
get_collection_level1(digest: str) -> dict
Get level 1 representation (attribute digests) for a collection.
Parameters:
-
digest(str) âCollection digest.
Returns:
-
dictâdict with spec-compliant field names (names, lengths, sequences,
-
dictâplus optional name_length_pairs, sorted_name_length_pairs, sorted_sequences).
get_collection_level2
get_collection_level2(digest: str) -> dict
Get level 2 representation (full arrays, spec format) for a collection.
Parameters:
-
digest(str) âCollection digest.
Returns:
-
dictâdict with names (list[str]), lengths (list[int]), sequences (list[str]).
compare
compare(digest_a: str, digest_b: str) -> dict
Compare two collections by digest.
Parameters:
-
digest_a(str) âFirst collection digest.
-
digest_b(str) âSecond collection digest.
Returns:
-
dictâdict with keys: digests, attributes, array_elements.
find_collections_by_attribute
find_collections_by_attribute(attr_name: str, attr_digest: str) -> List[str]
Find collections by attribute digest.
Parameters:
-
attr_name(str) âAttribute name (names, lengths, sequences, name_length_pairs, sorted_name_length_pairs, sorted_sequences).
-
attr_digest(str) âThe digest to search for.
Returns:
-
List[str]âList of collection digests that have the matching attribute.
get_attribute
get_attribute(attr_name: str, attr_digest: str) -> Optional[list]
Get attribute array by digest.
Parameters:
-
attr_name(str) âAttribute name (names, lengths, or sequences).
-
attr_digest(str) âThe digest to search for.
Returns:
-
Optional[list]âThe attribute array, or None if not found.
enable_ancillary_digests
enable_ancillary_digests() -> None
Enable computation of ancillary digests.
disable_ancillary_digests
disable_ancillary_digests() -> None
Disable computation of ancillary digests.
has_ancillary_digests
has_ancillary_digests() -> bool
Returns whether ancillary digests are enabled.
has_attribute_index
has_attribute_index() -> bool
Returns whether the on-disk attribute index is enabled.
enable_attribute_index
enable_attribute_index() -> None
Enable indexed attribute lookup (not yet implemented).
disable_attribute_index
disable_attribute_index() -> None
Disable indexed attribute lookup, using brute-force scan instead.
export_fasta_from_regions
export_fasta_from_regions(collection_digest: str, bed_file_path: Union[str, PathLike], output_file_path: Union[str, PathLike]) -> None
Export sequences from BED file regions to a FASTA file.
Reads a BED file defining genomic regions and exports the sequences for those regions to a FASTA file.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
-
bed_file_path(Union[str, PathLike]) âPath to BED file defining regions.
-
output_file_path(Union[str, PathLike]) âPath to write the output FASTA file.
Raises:
-
IOErrorâIf files cannot be read/written or sequences not found.
Example::
store.export_fasta_from_regions(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"regions.bed",
"output.fa"
)
substrings_from_regions
substrings_from_regions(collection_digest: str, bed_file_path: Union[str, PathLike]) -> List[RetrievedSequence]
Get substrings for BED file regions as a list.
Reads a BED file and returns a list of sequences for each region.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
-
bed_file_path(Union[str, PathLike]) âPath to BED file defining regions.
Returns:
-
List[RetrievedSequence]âList of retrieved sequence segments.
Raises:
-
IOErrorâIf files cannot be read or sequences not found.
Example::
sequences = store.substrings_from_regions(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"regions.bed"
)
for seq in sequences:
print(f"{seq.chrom_name}:{seq.start}-{seq.end}")
export_fasta
export_fasta(collection_digest: str, output_path: Union[str, PathLike], sequence_names: Optional[List[str]] = None, line_width: Optional[int] = None) -> None
Export sequences from a collection to a FASTA file.
Parameters:
-
collection_digest(str) âCollection to export from.
-
output_path(Union[str, PathLike]) âPath to write FASTA file.
-
sequence_names(Optional[List[str]], default:None) âOptional list of sequence names to export. If None, exports all sequences in the collection.
-
line_width(Optional[int], default:None) âOptional line width for wrapping sequences. If None, uses default of 80.
export_fasta_by_digests
export_fasta_by_digests(seq_digests: List[str], output_path: Union[str, PathLike], line_width: Optional[int] = None) -> None
Export sequences by their digests to a FASTA file.
Parameters:
-
seq_digests(List[str]) âList of sequence digests to export.
-
output_path(Union[str, PathLike]) âPath to write FASTA file.
-
line_width(Optional[int], default:None) âOptional line width for wrapping sequences. If None, uses default of 80.
add_sequence_alias
add_sequence_alias(namespace: str, alias: str, digest: str) -> None
Add a sequence alias: namespace/alias maps to sequence digest.
get_sequence_metadata_by_alias
get_sequence_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceMetadata]
Resolve a sequence alias to sequence metadata (no data loading).
get_sequence_by_alias
get_sequence_by_alias(namespace: str, alias: str) -> Optional[SequenceRecord]
Resolve a sequence alias and return the loaded sequence record.
Returns None if the alias is not found.
get_aliases_for_sequence
get_aliases_for_sequence(digest: str) -> list[tuple[str, str]]
Reverse lookup: find all (namespace, alias) pairs pointing to this sequence digest.
list_sequence_alias_namespaces
list_sequence_alias_namespaces() -> list[str]
List all sequence alias namespaces.
list_sequence_aliases
list_sequence_aliases(namespace: str) -> Optional[list[str]]
List all aliases in a sequence alias namespace.
remove_sequence_alias
remove_sequence_alias(namespace: str, alias: str) -> bool
Remove a single sequence alias. Returns True if it existed.
load_sequence_aliases
load_sequence_aliases(namespace: str, path: str) -> int
Load sequence aliases from a TSV file (alias\tdigest per line).
add_collection_alias
add_collection_alias(namespace: str, alias: str, digest: str) -> None
Add a collection alias: namespace/alias maps to collection digest.
get_collection_metadata_by_alias
get_collection_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceCollectionMetadata]
Resolve a collection alias to collection metadata (no data loading).
get_collection_by_alias
get_collection_by_alias(namespace: str, alias: str) -> Optional[SequenceCollection]
Resolve a collection alias and return the loaded collection.
Returns None if the alias is not found.
get_aliases_for_collection
get_aliases_for_collection(digest: str) -> list[tuple[str, str]]
Reverse lookup: find all (namespace, alias) pairs pointing to this collection digest.
list_collection_alias_namespaces
list_collection_alias_namespaces() -> list[str]
List all collection alias namespaces.
list_collection_aliases
list_collection_aliases(namespace: str) -> Optional[list[str]]
List all aliases in a collection alias namespace.
remove_collection_alias
remove_collection_alias(namespace: str, alias: str) -> bool
Remove a single collection alias. Returns True if it existed.
load_collection_aliases
load_collection_aliases(namespace: str, path: str) -> int
Load collection aliases from a TSV file (alias\tdigest per line).
set_fhr_metadata
set_fhr_metadata(collection_digest: str, metadata: FhrMetadata) -> None
Set FHR metadata for a collection.
get_fhr_metadata
get_fhr_metadata(collection_digest: str) -> Optional[FhrMetadata]
Get FHR metadata for a collection. Returns None if missing.
remove_fhr_metadata
remove_fhr_metadata(collection_digest: str) -> bool
Remove FHR metadata for a collection.
list_fhr_metadata
list_fhr_metadata() -> list[str]
List all collection digests that have FHR metadata.
load_fhr_metadata
load_fhr_metadata(collection_digest: str, path: str) -> None
Load FHR metadata from a JSON file and attach it to a collection.
into_readonly
into_readonly() -> ReadonlyRefgetStore
Convert to a ReadonlyRefgetStore for concurrent read access.
Consumes this store (replacing it with an empty in-memory store)
and returns a ReadonlyRefgetStore whose read methods all use &self
(no mutable borrow), making it suitable for Arc<ReadonlyRefgetStore>
in servers.
Call load_all_collections() or load_collection() before
converting, since ReadonlyRefgetStore cannot lazy-load.
Returns:
-
ReadonlyRefgetStore(ReadonlyRefgetStore) âAn immutable store suitable for concurrent access.
Example::
store = RefgetStore.open_remote("/cache", "https://example.com")
store.load_all_collections()
readonly = store.into_readonly()
coll = readonly.get_collection("abc123")
__len__
__len__() -> int
__iter__
__iter__() -> Iterator[SequenceMetadata]
__str__
__str__() -> str
__repr__
__repr__() -> str
ReadonlyRefgetStore
An immutable RefgetStore for concurrent read access.
All read methods use immutable references, making this suitable for concurrent access patterns (e.g., shared across threads in a server).
This type has NO write methods and NO constructors -- it is only
obtainable via RefgetStore.into_readonly().
Read methods that require preloaded data (e.g., get_collection())
will error if the data was not loaded before conversion.
Attributes:
-
cache_path(Optional[str]) âLocal directory path where the store is located or cached. None for in-memory stores.
-
remote_url(Optional[str]) âRemote URL of the store if loaded remotely, None otherwise.
-
storage_mode(StorageMode) âCurrent storage mode (Raw or Encoded).
Example::
store = RefgetStore.open_remote("/cache", "https://example.com")
store.load_all_collections()
readonly = store.into_readonly()
coll = readonly.get_collection("abc123")
Attributes
cache_path
instance-attribute
cache_path: Optional[str]
remote_url
instance-attribute
remote_url: Optional[str]
storage_mode
property
storage_mode: StorageMode
Current storage mode (Raw or Encoded).
Functions
list_collections
list_collections(page: int = 0, page_size: int = 100, filters: Optional[Dict[str, str]] = None) -> Dict[str, Any]
List collections with pagination and optional attribute filtering.
get_collection_metadata
get_collection_metadata(collection_digest: str) -> Optional[SequenceCollectionMetadata]
Get metadata for a collection by digest.
get_collection
get_collection(collection_digest: str) -> SequenceCollection
Get a collection by digest with all sequences loaded.
Requires that the collection was preloaded before conversion.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
Returns:
-
SequenceCollectionâThe collection with all sequence data loaded.
Raises:
-
IOErrorâIf the collection was not preloaded.
is_collection_loaded
is_collection_loaded(collection_digest: str) -> bool
Check if a collection is fully loaded.
get_collection_level1
get_collection_level1(digest: str) -> dict
Get level 1 representation (attribute digests) for a collection.
get_collection_level2
get_collection_level2(digest: str) -> dict
Get level 2 representation (full arrays) for a collection.
compare
compare(digest_a: str, digest_b: str) -> dict
Compare two collections by digest.
find_collections_by_attribute
find_collections_by_attribute(attr_name: str, attr_digest: str) -> List[str]
Find collections by attribute digest.
get_attribute
get_attribute(attr_name: str, attr_digest: str) -> Optional[list]
Get attribute array by digest.
has_ancillary_digests
has_ancillary_digests() -> bool
Returns whether ancillary digests are enabled.
has_attribute_index
has_attribute_index() -> bool
Returns whether the on-disk attribute index is enabled.
list_sequences
list_sequences() -> List[SequenceMetadata]
List all sequence metadata in the store.
get_sequence_metadata
get_sequence_metadata(seq_digest: str) -> Optional[SequenceMetadata]
Get metadata for a sequence by digest.
get_sequence
get_sequence(digest: str) -> SequenceRecord
Retrieve a sequence record by its digest.
Parameters:
-
digest(str) âSequence digest (SHA-512/24u or MD5).
Returns:
-
SequenceRecordâThe sequence record with data.
Raises:
-
KeyErrorâIf the sequence is not found.
get_sequence_by_name
get_sequence_by_name(collection_digest: str, sequence_name: str) -> SequenceRecord
Retrieve a sequence by collection digest and sequence name.
Parameters:
-
collection_digest(str) âThe collection's SHA-512/24u digest.
-
sequence_name(str) âName of the sequence within that collection.
Returns:
-
SequenceRecordâThe sequence record with data.
Raises:
-
KeyErrorâIf the sequence is not found.
get_substring
get_substring(seq_digest: str, start: int, end: int) -> str
Extract a substring from a sequence.
Parameters:
-
seq_digest(str) âSequence digest (SHA-512/24u).
-
start(int) âStart position (0-based, inclusive).
-
end(int) âEnd position (0-based, exclusive).
Returns:
-
strâThe substring sequence.
Raises:
-
KeyErrorâIf the sequence is not found.
stats
stats() -> dict
Returns statistics about the store.
get_sequence_metadata_by_alias
get_sequence_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceMetadata]
Resolve a sequence alias to sequence metadata.
get_sequence_by_alias
get_sequence_by_alias(namespace: str, alias: str) -> Optional[SequenceRecord]
Resolve a sequence alias and return the loaded sequence record.
get_aliases_for_sequence
get_aliases_for_sequence(digest: str) -> list[tuple[str, str]]
Reverse lookup: find all (namespace, alias) pairs for this sequence.
list_sequence_alias_namespaces
list_sequence_alias_namespaces() -> list[str]
List all sequence alias namespaces.
list_sequence_aliases
list_sequence_aliases(namespace: str) -> Optional[list[str]]
List all aliases in a sequence alias namespace.
get_collection_metadata_by_alias
get_collection_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceCollectionMetadata]
Resolve a collection alias to collection metadata.
get_collection_by_alias
get_collection_by_alias(namespace: str, alias: str) -> Optional[SequenceCollection]
Resolve a collection alias and return the loaded collection.
get_aliases_for_collection
get_aliases_for_collection(digest: str) -> list[tuple[str, str]]
Reverse lookup: find all (namespace, alias) pairs for this collection.
list_collection_alias_namespaces
list_collection_alias_namespaces() -> list[str]
List all collection alias namespaces.
list_collection_aliases
list_collection_aliases(namespace: str) -> Optional[list[str]]
List all aliases in a collection alias namespace.
get_fhr_metadata
get_fhr_metadata(collection_digest: str) -> Optional[FhrMetadata]
Get FHR metadata for a collection.
list_fhr_metadata
list_fhr_metadata() -> list[str]
List all collection digests that have FHR metadata.
__len__
__len__() -> int
__str__
__str__() -> str
__repr__
__repr__() -> str
Functions
sha512t24u_digest
sha512t24u_digest(readable: Union[str, bytes]) -> str
Compute the GA4GH SHA-512/24u digest for a sequence.
This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.
Parameters:
-
readable(Union[str, bytes]) âInput sequence as str or bytes.
Returns:
-
strâThe SHA-512/24u digest (32 character base64url string).
Raises:
-
TypeErrorâIf input is not str or bytes.
Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'
md5_digest
md5_digest(readable: Union[str, bytes]) -> str
Compute the MD5 digest for a sequence.
This function computes the MD5 hash for a given sequence string or bytes. MD5 is supported for backward compatibility with legacy systems.
Parameters:
-
readable(Union[str, bytes]) âInput sequence as str or bytes.
Returns:
-
strâThe MD5 digest (32 character hexadecimal string).
Raises:
-
TypeErrorâIf input is not str or bytes.
Example:: from gtars.refget import md5_digest digest = md5_digest("ACGT") print(digest) # Output: 'f1f8f4bf413b16ad135722aa4591043e'
digest_fasta
digest_fasta(fasta: Union[str, PathLike]) -> SequenceCollection
Digest all sequences in a FASTA file and compute collection-level digests.
This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.
Parameters:
-
fasta(Union[str, PathLike]) âPath to FASTA file (str or PathLike).
Returns:
-
SequenceCollectionâCollection containing all sequences with their metadata and computed digests.
Raises:
-
IOErrorâIf the FASTA file cannot be read or parsed.
Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")
compute_fai
compute_fai(fasta: Union[str, PathLike]) -> List[FaiRecord]
Compute FASTA index (FAI) metadata for all sequences in a FASTA file.
This function computes the FAI index metadata (offset, line_bases, line_bytes) for each sequence in a FASTA file, compatible with samtools faidx format. Only works with uncompressed FASTA files.
Parameters:
-
fasta(Union[str, PathLike]) âPath to FASTA file (str or PathLike). Must be uncompressed.
Returns:
-
List[FaiRecord]âList of FAI records, one per sequence, containing name, length,
-
List[FaiRecord]âand FAI metadata (offset, line_bases, line_bytes).
Raises:
-
IOErrorâIf the FASTA file cannot be read or is compressed.
Example:: from gtars.refget import compute_fai fai_records = compute_fai("genome.fa") for record in fai_records: print(f"{record.name}: {record.length} bp")
load_fasta
load_fasta(fasta: Union[str, PathLike]) -> SequenceCollection
Load a FASTA file with sequence data into a SequenceCollection.
This function reads a FASTA file and loads all sequences with their data into memory. Unlike digest_fasta(), this includes the actual sequence data, not just metadata.
Parameters:
-
fasta(Union[str, PathLike]) âPath to FASTA file (str or PathLike).
Returns:
-
SequenceCollectionâCollection containing all sequences with their metadata and sequence data loaded.
Raises:
-
IOErrorâIf the FASTA file cannot be read or parsed.
Example:: from gtars.refget import load_fasta collection = load_fasta("genome.fa") first_seq = collection[0] print(f"Sequence: {first_seq.decode()[:50]}...")
digest_sequence
digest_sequence(data: bytes, name: Optional[str] = None, description: Optional[str] = None) -> SequenceRecord
Create a SequenceRecord from raw data, computing all metadata.
This is the sequence-level parallel to digest_fasta() for collections. It computes the GA4GH sha512t24u digest, MD5 digest, detects the alphabet, and returns a SequenceRecord with computed metadata and the original data.
The input data is automatically uppercased to ensure consistent digest computation (matching FASTA processing behavior).
Parameters:
-
data(bytes) âThe raw sequence bytes (e.g., b"ACGTACGT").
-
name(Optional[str], default:None) âOptional sequence name (e.g., "chr1"). Defaults to "" if not provided.
-
description(Optional[str], default:None) âOptional description text for the sequence.
Returns:
-
SequenceRecordâA SequenceRecord with computed metadata and the original data (uppercased).
Example:: from gtars.refget import digest_sequence seq = digest_sequence(b"ACGTACGT") print(seq.metadata.length) # Output: 8
seq = digest_sequence(b"ACGT", name="chr1")
print(seq.metadata.name, seq.metadata.length)
# Output: chr1 4
# With description
seq2 = digest_sequence(b"ACGT", name="chr1", description="Chromosome 1")
print(seq2.metadata.description)
# Output: Chromosome 1