Skip to content

Refget Module API Reference

refget

Type stubs and documentation for the gtars.refget module.

This file serves two purposes:

  1. Type Hints: Provides type annotations for IDE autocomplete and static type checking tools like mypy.

  2. Documentation: Contains Google-style docstrings that mkdocstrings uses to generate the API reference documentation website.

Note: The actual implementation is in Rust (gtars-python/src/refget/mod.rs) and compiled via PyO3. This stub file provides the Python interface definition and structured documentation that tools can parse properly.

Classes

AlphabetType

Bases: Enum

Represents the type of alphabet for a sequence.

Attributes
Dna2bit instance-attribute
Dna2bit: int
Dna3bit instance-attribute
Dna3bit: int
DnaIupac instance-attribute
DnaIupac: int
Protein instance-attribute
Protein: int
Ascii instance-attribute
Ascii: int
Unknown instance-attribute
Unknown: int
Functions
__str__
__str__() -> str

FaiMetadata

FASTA index (FAI) metadata for a sequence.

Contains the information needed to quickly seek to a sequence in a FASTA file, compatible with samtools faidx format.

Attributes:

  • offset (int) –

    Byte offset of the first base in the FASTA file.

  • line_bases (int) –

    Number of bases per line.

  • line_bytes (int) –

    Number of bytes per line (including newline).

Attributes
offset instance-attribute
offset: int
line_bases instance-attribute
line_bases: int
line_bytes instance-attribute
line_bytes: int
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

FaiRecord

A FASTA index record for a single sequence.

Represents one line of a .fai index file with sequence name, length, and FAI metadata for random access.

Attributes:

  • name (str) –

    Sequence name.

  • length (int) –

    Sequence length in bases.

  • fai (Optional[FaiMetadata]) –

    FAI metadata (None for gzipped files).

Attributes
name instance-attribute
name: str
length instance-attribute
length: int
fai instance-attribute
fai: Optional[FaiMetadata]
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceMetadata

Metadata for a biological sequence.

Contains identifying information and computed digests for a sequence, without the actual sequence data.

Attributes:

  • name (str) –

    Sequence name (first word of FASTA header).

  • description (Optional[str]) –

    Description from FASTA header (text after first whitespace).

  • length (int) –

    Length of the sequence in bases.

  • sha512t24u (str) –

    GA4GH SHA-512/24u digest (32-char base64url).

  • md5 (str) –

    MD5 digest (32-char hex string).

  • alphabet (AlphabetType) –

    Detected alphabet type (DNA, protein, etc.).

  • fai (Optional[FaiMetadata]) –

    FASTA index metadata if available.

Attributes
name instance-attribute
name: str
description instance-attribute
description: Optional[str]
length instance-attribute
length: int
sha512t24u instance-attribute
sha512t24u: str
md5 instance-attribute
md5: str
alphabet instance-attribute
alphabet: AlphabetType
fai instance-attribute
fai: Optional[FaiMetadata]
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceRecord

A record representing a biological sequence, including its metadata and optional data.

SequenceRecord can be either a "stub" (metadata only) or "full" (metadata + data). Stubs are used for lazy-loading where sequence data is fetched on demand.

Attributes:

  • metadata (SequenceMetadata) –

    Sequence metadata (name, length, digests).

  • sequence (Optional[bytes]) –

    Raw sequence data if loaded, None for stubs.

  • is_loaded (bool) –

    Whether sequence data is loaded (True) or just metadata (False).

Attributes
metadata instance-attribute
metadata: SequenceMetadata
sequence instance-attribute
sequence: Optional[bytes]
is_loaded property
is_loaded: bool

Whether sequence data is loaded (true) or just metadata (false).

Functions
decode
decode() -> Optional[str]

Decode and return the sequence data as a string.

For Full records with sequence data, returns the decoded sequence. For Stub records without sequence data, returns None.

Returns:

  • Optional[str] –

    Decoded sequence string if data is available, None otherwise.

__repr__
__repr__() -> str
__str__
__str__() -> str

SeqColDigestLvl1

Level 1 digests for a sequence collection.

Attributes
sequences_digest instance-attribute
sequences_digest: str
names_digest instance-attribute
names_digest: str
lengths_digest instance-attribute
lengths_digest: str
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceCollectionMetadata

Metadata for a sequence collection.

Contains the collection digest and level 1 digests for names, sequences, and lengths. This is a lightweight representation of a collection without the actual sequence list.

Attributes:

Attributes
digest instance-attribute
digest: str
n_sequences instance-attribute
n_sequences: int
names_digest instance-attribute
names_digest: str
sequences_digest instance-attribute
sequences_digest: str
lengths_digest instance-attribute
lengths_digest: str
name_length_pairs_digest instance-attribute
name_length_pairs_digest: Optional[str]
sorted_name_length_pairs_digest instance-attribute
sorted_name_length_pairs_digest: Optional[str]
sorted_sequences_digest instance-attribute
sorted_sequences_digest: Optional[str]
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceCollection

A collection of biological sequences (e.g., a genome assembly).

SequenceCollection represents a set of sequences with collection-level digests following the GA4GH seqcol specification. Supports iteration, indexing, and len().

Attributes:

Examples:

Iterate over sequences::

for seq in collection:
    print(f"{seq.metadata.name}: {seq.metadata.length} bp")

Access by index::

first_seq = collection[0]
last_seq = collection[-1]

Get length::

n = len(collection)
Attributes
sequences instance-attribute
sequences: List[SequenceRecord]
digest instance-attribute
digest: str
lvl1 instance-attribute
lvl1: SeqColDigestLvl1
file_path instance-attribute
file_path: Optional[str]
Functions
write_fasta
write_fasta(file_path: str, line_width: Optional[int] = None) -> None

Write the collection to a FASTA file.

Parameters:

  • file_path (str) –

    Path to the output FASTA file.

  • line_width (Optional[int], default: None ) –

    Number of bases per line (default: 70).

Raises:

  • IOError –

    If any sequence doesn't have data loaded.

Example::

collection = load_fasta("genome.fa")
collection.write_fasta("output.fa")
collection.write_fasta("output.fa", line_width=60)
__len__
__len__() -> int
__getitem__
__getitem__(idx: int) -> SequenceRecord
__iter__
__iter__() -> Iterator[SequenceRecord]
__repr__
__repr__() -> str
__str__
__str__() -> str

RetrievedSequence

Represents a retrieved sequence segment with its metadata.

Returned by methods that extract subsequences from specific regions, such as substrings_from_regions().

Attributes:

  • sequence (str) –

    The extracted sequence string.

  • chrom_name (str) –

    Chromosome/sequence name (e.g., "chr1").

  • start (int) –

    Start position (0-based, inclusive).

  • end (int) –

    End position (0-based, exclusive).

Attributes
sequence instance-attribute
sequence: str
chrom_name instance-attribute
chrom_name: str
start instance-attribute
start: int
end instance-attribute
end: int
Functions
__init__
__init__(sequence: str, chrom_name: str, start: int, end: int) -> None
__repr__
__repr__() -> str
__str__
__str__() -> str

StorageMode

Bases: Enum

Defines how sequence data is stored in the Refget store.

Variants

Raw: Store sequences as raw bytes (1 byte per base). Encoded: Store sequences with 2-bit encoding (4 bases per byte).

Attributes
Raw instance-attribute
Raw: int
Encoded instance-attribute
Encoded: int

FhrMetadata

FAIR Headers Reference genome metadata for a sequence collection.

Fields match the FHR 1.0 specification. All fields are optional. Note: schema_version is a number (int or float) per spec, passed as a Python numeric type and stored as serde_json::Number internally.

Attributes
genome instance-attribute
genome: Optional[str]
version instance-attribute
version: Optional[str]
masking instance-attribute
masking: Optional[str]
genome_synonym instance-attribute
genome_synonym: Optional[list[str]]
voucher_specimen instance-attribute
voucher_specimen: Optional[str]
documentation instance-attribute
documentation: Optional[str]
identifier instance-attribute
identifier: Optional[list[str]]
scholarly_article instance-attribute
scholarly_article: Optional[str]
funding instance-attribute
funding: Optional[str]
Functions
__init__
__init__(**kwargs: Any) -> None
from_json staticmethod
from_json(path: str) -> FhrMetadata
to_dict
to_dict() -> dict[str, Any]
to_json
to_json(path: str) -> None
__repr__
__repr__() -> str

RefgetStore

A global store for GA4GH refget sequences with lazy-loading support.

RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.

Attributes:

  • cache_path (Optional[str]) –

    Local directory path where the store is located or cached. None for in-memory stores.

  • remote_url (Optional[str]) –

    Remote URL of the store if loaded remotely, None otherwise.

  • quiet (bool) –

    Whether the store suppresses progress output.

  • storage_mode (StorageMode) –

    Current storage mode (Raw or Encoded).

Note

Boolean evaluation: RefgetStore follows Python container semantics, meaning bool(store) is False for empty stores (like list, dict, etc.). To check if a store variable is initialized (not None), use if store is not None: rather than if store:.

Example::

store = RefgetStore.in_memory()  # Empty store
bool(store)  # False (empty container)
len(store)   # 0

# Wrong: checks emptiness, not initialization
if store:
    process(store)

# Right: checks if variable is set
if store is not None:
    process(store)

Examples:

Create a new store and import sequences::

from gtars.refget import RefgetStore
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")

Open an existing local store::

store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)

Open a remote store with caching::

store = RefgetStore.open_remote(
    "/local/cache",
    "https://example.com/hg38"
)
Attributes
cache_path instance-attribute
cache_path: Optional[str]
remote_url instance-attribute
remote_url: Optional[str]
quiet property
quiet: bool

Whether the store is in quiet mode.

storage_mode property
storage_mode: StorageMode

Current storage mode (Raw or Encoded).

is_persisting property
is_persisting: bool

Whether the store is currently persisting to disk.

Example::

store = RefgetStore.in_memory()
print(store.is_persisting)  # False
store.enable_persistence("/data/store")
print(store.is_persisting)  # True
Functions
in_memory classmethod
in_memory() -> RefgetStore

Create a new in-memory RefgetStore.

Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.

Returns:

  • RefgetStore –

    New empty RefgetStore with Encoded storage mode.

Example::

store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store_exists classmethod
store_exists(path: Union[str, PathLike]) -> bool

Check whether a valid RefgetStore exists at the given path.

Returns True if the path contains a store manifest file, indicating the store has been initialized. Returns False if the path does not exist or does not contain a store.

This avoids hardcoding knowledge of the store's internal file format in calling code.

Parameters:

  • path (Union[str, PathLike]) –

    Path to the store directory.

Returns:

  • bool –

    True if a store exists at the path, False otherwise.

Example::

from gtars.refget import RefgetStore
RefgetStore.store_exists("/data/hg38_store")  # True
RefgetStore.store_exists("/tmp/empty")  # False
on_disk classmethod
on_disk(cache_path: Union[str, PathLike]) -> RefgetStore

Create or load a disk-backed RefgetStore.

If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.

Parameters:

  • cache_path (Union[str, PathLike]) –

    Directory path for the store. Created if it doesn't exist.

Returns:

  • RefgetStore –

    RefgetStore (new or loaded from disk).

Example::

store = RefgetStore.on_disk("/data/my_store")
store.add_sequence_collection_from_fasta("genome.fa")
# Store is automatically persisted to disk
open_local classmethod
open_local(path: Union[str, PathLike]) -> RefgetStore

Open a local RefgetStore from a directory.

Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().

Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi

Parameters:

  • path (Union[str, PathLike]) –

    Local directory containing the refget store.

Returns:

  • RefgetStore –

    RefgetStore with metadata loaded, sequences lazy-loaded.

Raises:

  • IOError –

    If the store directory or index files cannot be read.

Example::

store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)
open_remote classmethod
open_remote(cache_path: Union[str, PathLike], remote_url: str) -> RefgetStore

Open a remote RefgetStore with local caching.

Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.

By default, persistence is enabled (sequences are cached to disk). Call disable_persistence() after loading to keep only in memory.

Parameters:

  • cache_path (Union[str, PathLike]) –

    Local directory to cache downloaded metadata and sequences. Created if it doesn't exist.

  • remote_url (str) –

    Base URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38").

Returns:

  • RefgetStore –

    RefgetStore with metadata loaded, sequences fetched on-demand.

Raises:

  • IOError –

    If remote metadata cannot be fetched or cache cannot be written.

Example::

store = RefgetStore.open_remote(
    "/data/cache/hg38",
    "https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)
set_encoding_mode
set_encoding_mode(mode: StorageMode) -> None

Change the storage mode, re-encoding/decoding existing sequences as needed.

When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.

Parameters:

  • mode (StorageMode) –

    The storage mode to switch to (StorageMode.Raw or StorageMode.Encoded).

Example::

store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)
enable_encoding
enable_encoding() -> None

Enable 2-bit encoding for space efficiency.

Re-encodes any existing Raw sequences in memory.

Example::

store = RefgetStore.in_memory()
store.disable_encoding()  # Switch to Raw
store.enable_encoding()   # Back to Encoded
disable_encoding
disable_encoding() -> None

Disable encoding, use raw byte storage.

Decodes any existing Encoded sequences in memory.

Example::

store = RefgetStore.in_memory()
store.disable_encoding()  # Switch to Raw mode
set_quiet
set_quiet(quiet: bool) -> None

Set whether to suppress progress output.

When quiet is True, operations like add_sequence_collection_from_fasta will not print progress messages.

Parameters:

  • quiet (bool) –

    Whether to suppress progress output.

Example::

store = RefgetStore.in_memory()
store.set_quiet(True)
store.add_sequence_collection_from_fasta("genome.fa")  # No output
enable_persistence
enable_persistence(path: Union[str, PathLike]) -> None

Enable disk persistence for this store.

Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.

Parameters:

  • path (Union[str, PathLike]) –

    Directory for storing sequences and metadata.

Raises:

  • IOError –

    If the directory cannot be created or written to.

Example::

store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store")  # Flush to disk
disable_persistence
disable_persistence() -> None

Disable disk persistence for this store.

New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.

Example::

store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence()  # Stop caching new sequences
add_sequence_collection_from_fasta
add_sequence_collection_from_fasta(file_path: Union[str, PathLike], force: bool = False, namespaces: Optional[List[str]] = None) -> tuple[SequenceCollectionMetadata, bool]

Add a sequence collection from a FASTA file.

Reads a FASTA file, digests the sequences, creates a SequenceCollection, and adds it to the store along with all its sequences.

Parameters:

  • file_path (Union[str, PathLike]) –

    Path to the FASTA file to import.

  • force (bool, default: False ) –

    If True, overwrite existing collections/sequences. If False (default), skip duplicates.

  • namespaces (Optional[List[str]], default: None ) –

    Optional list of namespace prefixes to extract aliases from FASTA headers. For example, ["ncbi", "refseq"] will scan headers for tokens like ncbi:NC_000001.11 and register them as aliases.

Returns:

  • tuple[SequenceCollectionMetadata, bool] –

    A tuple containing: - SequenceCollectionMetadata: Metadata for the collection. - bool: True if the collection was newly added, False if it already existed.

Raises:

  • IOError –

    If the file cannot be read or processed.

Example::

store = RefgetStore.in_memory()
metadata, was_new = store.add_sequence_collection_from_fasta("genome.fa")
print(f"{'Added' if was_new else 'Skipped'}: {metadata.digest}")

# Extract aliases from FASTA headers
metadata, was_new = store.add_sequence_collection_from_fasta(
    "genome.fa", namespaces=["ncbi", "refseq"]
)
add_sequence_collection
add_sequence_collection(collection: SequenceCollection, force: bool = False) -> None

Add a pre-built SequenceCollection to the store.

Adds a SequenceCollection (created via digest_fasta() or programmatically) directly to the store without reading from a FASTA file.

Parameters:

  • collection (SequenceCollection) –

    A SequenceCollection to add.

  • force (bool, default: False ) –

    If True, overwrite existing collections/sequences. If False (default), skip duplicates.

Raises:

  • IOError –

    If the collection cannot be stored.

Example::

from gtars.refget import RefgetStore, digest_fasta
store = RefgetStore.in_memory()
collection = digest_fasta("genome.fa")
store.add_sequence_collection(collection)
add_sequence
add_sequence(sequence: SequenceRecord, force: bool = False) -> None

Add a sequence to the store without collection association.

The sequence can be created using digest_sequence() and later retrieved by its digest via get_sequence().

Parameters:

  • sequence (SequenceRecord) –

    A SequenceRecord created by digest_sequence().

  • force (bool, default: False ) –

    If True, overwrite existing. If False (default), skip duplicates.

Raises:

  • IOError –

    If the sequence cannot be stored.

Example::

from gtars.refget import RefgetStore, digest_sequence
store = RefgetStore.in_memory()
seq = digest_sequence(b"ACGTACGT")
store.add_sequence(seq)
retrieved = store.get_sequence(seq.metadata.sha512t24u)
list_collections
list_collections(page: int = 0, page_size: int = 100, filters: Optional[Dict[str, str]] = None) -> Dict[str, Any]

List collections with pagination and optional attribute filtering.

Parameters:

  • page (int, default: 0 ) –

    0-indexed page number.

  • page_size (int, default: 100 ) –

    Number of results per page.

  • filters (Optional[Dict[str, str]], default: None ) –

    Optional attribute filters (AND logic). Keys are attribute names (names, lengths, sequences, name_length_pairs, sorted_name_length_pairs, sorted_sequences), values are digests.

Returns:

  • Dict[str, Any] –

    Dict with "results" (list of SequenceCollectionMetadata) and

  • Dict[str, Any] –

    "pagination" (dict with page, page_size, total).

Example::

# Get first page of all collections
result = store.list_collections()
for meta in result["results"]:
    print(f"{meta.digest}: {meta.n_sequences} sequences")
print(f"Total: {result['pagination']['total']}")

# Filter by names digest
result = store.list_collections(filters={"names": "abc123"})
remove_collection
remove_collection(digest: str, remove_orphan_sequences: bool = False) -> bool

Remove a collection from the store.

Parameters:

  • digest (str) –

    The collection's SHA-512/24u digest string.

  • remove_orphan_sequences (bool, default: False ) –

    If True, also remove sequences no longer referenced by any remaining collection. Default: False.

Returns:

  • bool –

    True if the collection was found and removed, False if not found.

get_collection_metadata
get_collection_metadata(collection_digest: str) -> Optional[SequenceCollectionMetadata]

Get metadata for a collection by digest.

Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

Example::

meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
    print(f"Collection has {meta.n_sequences} sequences")
get_collection
get_collection(collection_digest: str) -> SequenceCollection

Get a collection by digest with all sequences loaded.

Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

Raises:

  • IOError –

    If the collection cannot be loaded.

Example::

collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
iter_collections
iter_collections() -> List[SequenceCollection]

Iterate over all collections with their sequences loaded.

This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.

For browsing without loading data, use list_collections() instead.

Returns:

Example::

for coll in store.iter_collections():
    print(f"{coll.digest}: {len(coll.sequences)} sequences")
is_collection_loaded
is_collection_loaded(collection_digest: str) -> bool

Check if a collection is fully loaded.

Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

  • bool –

    True if loaded, False otherwise.

list_sequences
list_sequences() -> List[SequenceMetadata]

List all sequence metadata in the store.

Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.

Returns:

  • List[SequenceMetadata] –

    List of metadata for all sequences in the store.

Example::

for meta in store.list_sequences():
    print(f"{meta.name}: {meta.length} bp")
get_sequence_metadata
get_sequence_metadata(seq_digest: str) -> Optional[SequenceMetadata]

Get metadata for a sequence by digest (no data loaded).

Use this for lightweight lookups when you don't need the actual sequence. Automatically strips "SQ." prefix from digest if present.

Parameters:

  • seq_digest (str) –

    The sequence's SHA-512/24u digest, optionally with "SQ." prefix.

Returns:

get_sequence
get_sequence(digest: str) -> SequenceRecord

Retrieve a sequence record by its digest (SHA-512/24u or MD5).

Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest. Automatically strips "SQ." prefix if present (case-insensitive).

Parameters:

  • digest (str) –

    Sequence digest (SHA-512/24u base64url or MD5 hex string), optionally with "SQ." prefix.

Returns:

Raises:

  • KeyError –

    If the sequence is not found.

Example::

record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
print(f"Found: {record.metadata.name}")
# Also works with SQ. prefix
record = store.get_sequence("SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
get_sequence_by_name
get_sequence_by_name(collection_digest: str, sequence_name: str) -> SequenceRecord

Retrieve a sequence by collection digest and sequence name.

Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed. Automatically strips "SQ." prefix from collection digest if present.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest, optionally with "SQ." prefix.

  • sequence_name (str) –

    Name of the sequence within that collection.

Returns:

Raises:

  • KeyError –

    If the sequence is not found.

Example::

record = store.get_sequence_by_name(
    "uC_UorBNf3YUu1YIDainBhI94CedlNeH",
    "chr1"
)
print(f"Sequence: {record.decode()[:50]}...")
iter_sequences
iter_sequences() -> List[SequenceRecord]

Iterate over all sequences with their data loaded.

This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.

For browsing without loading data, use list_sequences() instead.

Returns:

Example::

for seq in store.iter_sequences():
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
get_substring
get_substring(seq_digest: str, start: int, end: int) -> str

Extract a substring from a sequence.

Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores). Automatically strips "SQ." prefix from digest if present.

Parameters:

  • seq_digest (str) –

    Sequence digest (SHA-512/24u), optionally with "SQ." prefix.

  • start (int) –

    Start position (0-based, inclusive).

  • end (int) –

    End position (0-based, exclusive).

Returns:

  • str –

    The substring sequence.

Raises:

  • KeyError –

    If the sequence is not found.

Example::

# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")
stats
stats() -> dict

Returns statistics about the store.

Returns:

  • dict –

    dict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded')

Note

n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.

Example::

stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")
write
write() -> None

Write the store using its configured paths.

Convenience method for disk-backed stores. Uses the store's own local_path and seqdata_path_template.

Raises:

  • IOError –

    If the store cannot be written.

write_store_to_dir
write_store_to_dir(root_path: Union[str, PathLike], seqdata_path_template: Optional[str] = None) -> None

Write the store to a directory on disk.

Persists the store with all sequences and metadata to disk using the RefgetStore directory format.

Parameters:

  • root_path (Union[str, PathLike]) –

    Directory path to write the store to.

  • seqdata_path_template (Optional[str], default: None ) –

    Optional path template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest). Uses default if not specified.

Example::

store.write_store_to_dir("/data/my_store")
store.write_store_to_dir("/data/my_store", "sequences/%s2/%s.seq")
get_collection_level1
get_collection_level1(digest: str) -> dict

Get level 1 representation (attribute digests) for a collection.

Parameters:

  • digest (str) –

    Collection digest.

Returns:

  • dict –

    dict with spec-compliant field names (names, lengths, sequences,

  • dict –

    plus optional name_length_pairs, sorted_name_length_pairs, sorted_sequences).

get_collection_level2
get_collection_level2(digest: str) -> dict

Get level 2 representation (full arrays, spec format) for a collection.

Parameters:

  • digest (str) –

    Collection digest.

Returns:

  • dict –

    dict with names (list[str]), lengths (list[int]), sequences (list[str]).

compare
compare(digest_a: str, digest_b: str) -> dict

Compare two collections by digest.

Parameters:

  • digest_a (str) –

    First collection digest.

  • digest_b (str) –

    Second collection digest.

Returns:

  • dict –

    dict with keys: digests, attributes, array_elements.

find_collections_by_attribute
find_collections_by_attribute(attr_name: str, attr_digest: str) -> List[str]

Find collections by attribute digest.

Parameters:

  • attr_name (str) –

    Attribute name (names, lengths, sequences, name_length_pairs, sorted_name_length_pairs, sorted_sequences).

  • attr_digest (str) –

    The digest to search for.

Returns:

  • List[str] –

    List of collection digests that have the matching attribute.

get_attribute
get_attribute(attr_name: str, attr_digest: str) -> Optional[list]

Get attribute array by digest.

Parameters:

  • attr_name (str) –

    Attribute name (names, lengths, or sequences).

  • attr_digest (str) –

    The digest to search for.

Returns:

  • Optional[list] –

    The attribute array, or None if not found.

enable_ancillary_digests
enable_ancillary_digests() -> None

Enable computation of ancillary digests.

disable_ancillary_digests
disable_ancillary_digests() -> None

Disable computation of ancillary digests.

has_ancillary_digests
has_ancillary_digests() -> bool

Returns whether ancillary digests are enabled.

has_attribute_index
has_attribute_index() -> bool

Returns whether the on-disk attribute index is enabled.

enable_attribute_index
enable_attribute_index() -> None

Enable indexed attribute lookup (not yet implemented).

disable_attribute_index
disable_attribute_index() -> None

Disable indexed attribute lookup, using brute-force scan instead.

export_fasta_from_regions
export_fasta_from_regions(collection_digest: str, bed_file_path: Union[str, PathLike], output_file_path: Union[str, PathLike]) -> None

Export sequences from BED file regions to a FASTA file.

Reads a BED file defining genomic regions and exports the sequences for those regions to a FASTA file.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

  • bed_file_path (Union[str, PathLike]) –

    Path to BED file defining regions.

  • output_file_path (Union[str, PathLike]) –

    Path to write the output FASTA file.

Raises:

  • IOError –

    If files cannot be read/written or sequences not found.

Example::

store.export_fasta_from_regions(
    "uC_UorBNf3YUu1YIDainBhI94CedlNeH",
    "regions.bed",
    "output.fa"
)
substrings_from_regions
substrings_from_regions(collection_digest: str, bed_file_path: Union[str, PathLike]) -> List[RetrievedSequence]

Get substrings for BED file regions as a list.

Reads a BED file and returns a list of sequences for each region.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

  • bed_file_path (Union[str, PathLike]) –

    Path to BED file defining regions.

Returns:

Raises:

  • IOError –

    If files cannot be read or sequences not found.

Example::

sequences = store.substrings_from_regions(
    "uC_UorBNf3YUu1YIDainBhI94CedlNeH",
    "regions.bed"
)
for seq in sequences:
    print(f"{seq.chrom_name}:{seq.start}-{seq.end}")
export_fasta
export_fasta(collection_digest: str, output_path: Union[str, PathLike], sequence_names: Optional[List[str]] = None, line_width: Optional[int] = None) -> None

Export sequences from a collection to a FASTA file.

Parameters:

  • collection_digest (str) –

    Collection to export from.

  • output_path (Union[str, PathLike]) –

    Path to write FASTA file.

  • sequence_names (Optional[List[str]], default: None ) –

    Optional list of sequence names to export. If None, exports all sequences in the collection.

  • line_width (Optional[int], default: None ) –

    Optional line width for wrapping sequences. If None, uses default of 80.

export_fasta_by_digests
export_fasta_by_digests(seq_digests: List[str], output_path: Union[str, PathLike], line_width: Optional[int] = None) -> None

Export sequences by their digests to a FASTA file.

Parameters:

  • seq_digests (List[str]) –

    List of sequence digests to export.

  • output_path (Union[str, PathLike]) –

    Path to write FASTA file.

  • line_width (Optional[int], default: None ) –

    Optional line width for wrapping sequences. If None, uses default of 80.

add_sequence_alias
add_sequence_alias(namespace: str, alias: str, digest: str) -> None

Add a sequence alias: namespace/alias maps to sequence digest.

get_sequence_metadata_by_alias
get_sequence_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceMetadata]

Resolve a sequence alias to sequence metadata (no data loading).

get_sequence_by_alias
get_sequence_by_alias(namespace: str, alias: str) -> Optional[SequenceRecord]

Resolve a sequence alias and return the loaded sequence record.

Returns None if the alias is not found.

get_aliases_for_sequence
get_aliases_for_sequence(digest: str) -> list[tuple[str, str]]

Reverse lookup: find all (namespace, alias) pairs pointing to this sequence digest.

list_sequence_alias_namespaces
list_sequence_alias_namespaces() -> list[str]

List all sequence alias namespaces.

list_sequence_aliases
list_sequence_aliases(namespace: str) -> Optional[list[str]]

List all aliases in a sequence alias namespace.

remove_sequence_alias
remove_sequence_alias(namespace: str, alias: str) -> bool

Remove a single sequence alias. Returns True if it existed.

load_sequence_aliases
load_sequence_aliases(namespace: str, path: str) -> int

Load sequence aliases from a TSV file (alias\tdigest per line).

add_collection_alias
add_collection_alias(namespace: str, alias: str, digest: str) -> None

Add a collection alias: namespace/alias maps to collection digest.

get_collection_metadata_by_alias
get_collection_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceCollectionMetadata]

Resolve a collection alias to collection metadata (no data loading).

get_collection_by_alias
get_collection_by_alias(namespace: str, alias: str) -> Optional[SequenceCollection]

Resolve a collection alias and return the loaded collection.

Returns None if the alias is not found.

get_aliases_for_collection
get_aliases_for_collection(digest: str) -> list[tuple[str, str]]

Reverse lookup: find all (namespace, alias) pairs pointing to this collection digest.

list_collection_alias_namespaces
list_collection_alias_namespaces() -> list[str]

List all collection alias namespaces.

list_collection_aliases
list_collection_aliases(namespace: str) -> Optional[list[str]]

List all aliases in a collection alias namespace.

remove_collection_alias
remove_collection_alias(namespace: str, alias: str) -> bool

Remove a single collection alias. Returns True if it existed.

load_collection_aliases
load_collection_aliases(namespace: str, path: str) -> int

Load collection aliases from a TSV file (alias\tdigest per line).

set_fhr_metadata
set_fhr_metadata(collection_digest: str, metadata: FhrMetadata) -> None

Set FHR metadata for a collection.

get_fhr_metadata
get_fhr_metadata(collection_digest: str) -> Optional[FhrMetadata]

Get FHR metadata for a collection. Returns None if missing.

remove_fhr_metadata
remove_fhr_metadata(collection_digest: str) -> bool

Remove FHR metadata for a collection.

list_fhr_metadata
list_fhr_metadata() -> list[str]

List all collection digests that have FHR metadata.

load_fhr_metadata
load_fhr_metadata(collection_digest: str, path: str) -> None

Load FHR metadata from a JSON file and attach it to a collection.

into_readonly
into_readonly() -> ReadonlyRefgetStore

Convert to a ReadonlyRefgetStore for concurrent read access.

Consumes this store (replacing it with an empty in-memory store) and returns a ReadonlyRefgetStore whose read methods all use &self (no mutable borrow), making it suitable for Arc<ReadonlyRefgetStore> in servers.

Call load_all_collections() or load_collection() before converting, since ReadonlyRefgetStore cannot lazy-load.

Returns:

  • ReadonlyRefgetStore ( ReadonlyRefgetStore ) –

    An immutable store suitable for concurrent access.

Example::

store = RefgetStore.open_remote("/cache", "https://example.com")
store.load_all_collections()
readonly = store.into_readonly()
coll = readonly.get_collection("abc123")
__len__
__len__() -> int
__iter__
__iter__() -> Iterator[SequenceMetadata]
__str__
__str__() -> str
__repr__
__repr__() -> str

ReadonlyRefgetStore

An immutable RefgetStore for concurrent read access.

All read methods use immutable references, making this suitable for concurrent access patterns (e.g., shared across threads in a server).

This type has NO write methods and NO constructors -- it is only obtainable via RefgetStore.into_readonly().

Read methods that require preloaded data (e.g., get_collection()) will error if the data was not loaded before conversion.

Attributes:

  • cache_path (Optional[str]) –

    Local directory path where the store is located or cached. None for in-memory stores.

  • remote_url (Optional[str]) –

    Remote URL of the store if loaded remotely, None otherwise.

  • storage_mode (StorageMode) –

    Current storage mode (Raw or Encoded).

Example::

store = RefgetStore.open_remote("/cache", "https://example.com")
store.load_all_collections()
readonly = store.into_readonly()
coll = readonly.get_collection("abc123")
Attributes
cache_path instance-attribute
cache_path: Optional[str]
remote_url instance-attribute
remote_url: Optional[str]
storage_mode property
storage_mode: StorageMode

Current storage mode (Raw or Encoded).

Functions
list_collections
list_collections(page: int = 0, page_size: int = 100, filters: Optional[Dict[str, str]] = None) -> Dict[str, Any]

List collections with pagination and optional attribute filtering.

get_collection_metadata
get_collection_metadata(collection_digest: str) -> Optional[SequenceCollectionMetadata]

Get metadata for a collection by digest.

get_collection
get_collection(collection_digest: str) -> SequenceCollection

Get a collection by digest with all sequences loaded.

Requires that the collection was preloaded before conversion.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

Raises:

  • IOError –

    If the collection was not preloaded.

is_collection_loaded
is_collection_loaded(collection_digest: str) -> bool

Check if a collection is fully loaded.

get_collection_level1
get_collection_level1(digest: str) -> dict

Get level 1 representation (attribute digests) for a collection.

get_collection_level2
get_collection_level2(digest: str) -> dict

Get level 2 representation (full arrays) for a collection.

compare
compare(digest_a: str, digest_b: str) -> dict

Compare two collections by digest.

find_collections_by_attribute
find_collections_by_attribute(attr_name: str, attr_digest: str) -> List[str]

Find collections by attribute digest.

get_attribute
get_attribute(attr_name: str, attr_digest: str) -> Optional[list]

Get attribute array by digest.

has_ancillary_digests
has_ancillary_digests() -> bool

Returns whether ancillary digests are enabled.

has_attribute_index
has_attribute_index() -> bool

Returns whether the on-disk attribute index is enabled.

list_sequences
list_sequences() -> List[SequenceMetadata]

List all sequence metadata in the store.

get_sequence_metadata
get_sequence_metadata(seq_digest: str) -> Optional[SequenceMetadata]

Get metadata for a sequence by digest.

get_sequence
get_sequence(digest: str) -> SequenceRecord

Retrieve a sequence record by its digest.

Parameters:

  • digest (str) –

    Sequence digest (SHA-512/24u or MD5).

Returns:

Raises:

  • KeyError –

    If the sequence is not found.

get_sequence_by_name
get_sequence_by_name(collection_digest: str, sequence_name: str) -> SequenceRecord

Retrieve a sequence by collection digest and sequence name.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

  • sequence_name (str) –

    Name of the sequence within that collection.

Returns:

Raises:

  • KeyError –

    If the sequence is not found.

get_substring
get_substring(seq_digest: str, start: int, end: int) -> str

Extract a substring from a sequence.

Parameters:

  • seq_digest (str) –

    Sequence digest (SHA-512/24u).

  • start (int) –

    Start position (0-based, inclusive).

  • end (int) –

    End position (0-based, exclusive).

Returns:

  • str –

    The substring sequence.

Raises:

  • KeyError –

    If the sequence is not found.

stats
stats() -> dict

Returns statistics about the store.

get_sequence_metadata_by_alias
get_sequence_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceMetadata]

Resolve a sequence alias to sequence metadata.

get_sequence_by_alias
get_sequence_by_alias(namespace: str, alias: str) -> Optional[SequenceRecord]

Resolve a sequence alias and return the loaded sequence record.

get_aliases_for_sequence
get_aliases_for_sequence(digest: str) -> list[tuple[str, str]]

Reverse lookup: find all (namespace, alias) pairs for this sequence.

list_sequence_alias_namespaces
list_sequence_alias_namespaces() -> list[str]

List all sequence alias namespaces.

list_sequence_aliases
list_sequence_aliases(namespace: str) -> Optional[list[str]]

List all aliases in a sequence alias namespace.

get_collection_metadata_by_alias
get_collection_metadata_by_alias(namespace: str, alias: str) -> Optional[SequenceCollectionMetadata]

Resolve a collection alias to collection metadata.

get_collection_by_alias
get_collection_by_alias(namespace: str, alias: str) -> Optional[SequenceCollection]

Resolve a collection alias and return the loaded collection.

get_aliases_for_collection
get_aliases_for_collection(digest: str) -> list[tuple[str, str]]

Reverse lookup: find all (namespace, alias) pairs for this collection.

list_collection_alias_namespaces
list_collection_alias_namespaces() -> list[str]

List all collection alias namespaces.

list_collection_aliases
list_collection_aliases(namespace: str) -> Optional[list[str]]

List all aliases in a collection alias namespace.

get_fhr_metadata
get_fhr_metadata(collection_digest: str) -> Optional[FhrMetadata]

Get FHR metadata for a collection.

list_fhr_metadata
list_fhr_metadata() -> list[str]

List all collection digests that have FHR metadata.

__len__
__len__() -> int
__str__
__str__() -> str
__repr__
__repr__() -> str

Functions

sha512t24u_digest

sha512t24u_digest(readable: Union[str, bytes]) -> str

Compute the GA4GH SHA-512/24u digest for a sequence.

This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.

Parameters:

  • readable (Union[str, bytes]) –

    Input sequence as str or bytes.

Returns:

  • str –

    The SHA-512/24u digest (32 character base64url string).

Raises:

  • TypeError –

    If input is not str or bytes.

Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

md5_digest

md5_digest(readable: Union[str, bytes]) -> str

Compute the MD5 digest for a sequence.

This function computes the MD5 hash for a given sequence string or bytes. MD5 is supported for backward compatibility with legacy systems.

Parameters:

  • readable (Union[str, bytes]) –

    Input sequence as str or bytes.

Returns:

  • str –

    The MD5 digest (32 character hexadecimal string).

Raises:

  • TypeError –

    If input is not str or bytes.

Example:: from gtars.refget import md5_digest digest = md5_digest("ACGT") print(digest) # Output: 'f1f8f4bf413b16ad135722aa4591043e'

digest_fasta

digest_fasta(fasta: Union[str, PathLike]) -> SequenceCollection

Digest all sequences in a FASTA file and compute collection-level digests.

This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.

Parameters:

  • fasta (Union[str, PathLike]) –

    Path to FASTA file (str or PathLike).

Returns:

  • SequenceCollection –

    Collection containing all sequences with their metadata and computed digests.

Raises:

  • IOError –

    If the FASTA file cannot be read or parsed.

Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")

compute_fai

compute_fai(fasta: Union[str, PathLike]) -> List[FaiRecord]

Compute FASTA index (FAI) metadata for all sequences in a FASTA file.

This function computes the FAI index metadata (offset, line_bases, line_bytes) for each sequence in a FASTA file, compatible with samtools faidx format. Only works with uncompressed FASTA files.

Parameters:

  • fasta (Union[str, PathLike]) –

    Path to FASTA file (str or PathLike). Must be uncompressed.

Returns:

  • List[FaiRecord] –

    List of FAI records, one per sequence, containing name, length,

  • List[FaiRecord] –

    and FAI metadata (offset, line_bases, line_bytes).

Raises:

  • IOError –

    If the FASTA file cannot be read or is compressed.

Example:: from gtars.refget import compute_fai fai_records = compute_fai("genome.fa") for record in fai_records: print(f"{record.name}: {record.length} bp")

load_fasta

load_fasta(fasta: Union[str, PathLike]) -> SequenceCollection

Load a FASTA file with sequence data into a SequenceCollection.

This function reads a FASTA file and loads all sequences with their data into memory. Unlike digest_fasta(), this includes the actual sequence data, not just metadata.

Parameters:

  • fasta (Union[str, PathLike]) –

    Path to FASTA file (str or PathLike).

Returns:

  • SequenceCollection –

    Collection containing all sequences with their metadata and sequence data loaded.

Raises:

  • IOError –

    If the FASTA file cannot be read or parsed.

Example:: from gtars.refget import load_fasta collection = load_fasta("genome.fa") first_seq = collection[0] print(f"Sequence: {first_seq.decode()[:50]}...")

digest_sequence

digest_sequence(data: bytes, name: Optional[str] = None, description: Optional[str] = None) -> SequenceRecord

Create a SequenceRecord from raw data, computing all metadata.

This is the sequence-level parallel to digest_fasta() for collections. It computes the GA4GH sha512t24u digest, MD5 digest, detects the alphabet, and returns a SequenceRecord with computed metadata and the original data.

The input data is automatically uppercased to ensure consistent digest computation (matching FASTA processing behavior).

Parameters:

  • data (bytes) –

    The raw sequence bytes (e.g., b"ACGTACGT").

  • name (Optional[str], default: None ) –

    Optional sequence name (e.g., "chr1"). Defaults to "" if not provided.

  • description (Optional[str], default: None ) –

    Optional description text for the sequence.

Returns:

  • SequenceRecord –

    A SequenceRecord with computed metadata and the original data (uppercased).

Example:: from gtars.refget import digest_sequence seq = digest_sequence(b"ACGTACGT") print(seq.metadata.length) # Output: 8

seq = digest_sequence(b"ACGT", name="chr1")
print(seq.metadata.name, seq.metadata.length)
# Output: chr1 4

# With description
seq2 = digest_sequence(b"ACGT", name="chr1", description="Chromosome 1")
print(seq2.metadata.description)
# Output: Chromosome 1