Skip to content

Refget Module API Reference

refget

Type stubs and documentation for the gtars.refget module.

This file serves two purposes:

  1. Type Hints: Provides type annotations for IDE autocomplete and static type checking tools like mypy.

  2. Documentation: Contains Google-style docstrings that mkdocstrings uses to generate the API reference documentation website.

Note: The actual implementation is in Rust (gtars-python/src/refget/mod.rs) and compiled via PyO3. This stub file provides the Python interface definition and structured documentation that tools can parse properly.

Classes

AlphabetType

Bases: Enum

Represents the type of alphabet for a sequence.

Attributes
Dna2bit instance-attribute
Dna2bit: int
Dna3bit instance-attribute
Dna3bit: int
DnaIupac instance-attribute
DnaIupac: int
Protein instance-attribute
Protein: int
Ascii instance-attribute
Ascii: int
Unknown instance-attribute
Unknown: int
Functions
__str__
__str__() -> str

SequenceMetadata

Metadata for a biological sequence.

Attributes
name instance-attribute
name: str
length instance-attribute
length: int
sha512t24u instance-attribute
sha512t24u: str
md5 instance-attribute
md5: str
alphabet instance-attribute
alphabet: AlphabetType
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceRecord

A record representing a biological sequence, including its metadata and optional data.

Attributes
metadata instance-attribute
metadata: SequenceMetadata
sequence instance-attribute
sequence: Optional[bytes]
Functions
decode
decode() -> Optional[str]

Decode and return the sequence data as a string.

For Full records with sequence data, returns the decoded sequence. For Stub records without sequence data, returns None.

Returns:

  • Optional[str] –

    Decoded sequence string if data is available, None otherwise.

__repr__
__repr__() -> str
__str__
__str__() -> str

SeqColDigestLvl1

Level 1 digests for a sequence collection.

Attributes
sequences_digest instance-attribute
sequences_digest: str
names_digest instance-attribute
names_digest: str
lengths_digest instance-attribute
lengths_digest: str
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceCollectionMetadata

Metadata for a sequence collection.

Contains the collection digest and level 1 digests for names, sequences, and lengths. This is a lightweight representation of a collection without the actual sequence list.

Attributes:

  • digest (str) –

    The collection's SHA-512/24u digest.

  • n_sequences (int) –

    Number of sequences in the collection.

  • names_digest (str) –

    Level 1 digest of the names array.

  • sequences_digest (str) –

    Level 1 digest of the sequences array.

  • lengths_digest (str) –

    Level 1 digest of the lengths array.

Attributes
digest instance-attribute
digest: str
n_sequences instance-attribute
n_sequences: int
names_digest instance-attribute
names_digest: str
sequences_digest instance-attribute
sequences_digest: str
lengths_digest instance-attribute
lengths_digest: str
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceCollection

A collection of biological sequences.

Attributes
sequences instance-attribute
sequences: List[SequenceRecord]
digest instance-attribute
digest: str
lvl1 instance-attribute
lvl1: SeqColDigestLvl1
file_path instance-attribute
file_path: Optional[str]
has_data instance-attribute
has_data: bool
Functions
__repr__
__repr__() -> str
__str__
__str__() -> str

SequenceCollectionRecord

A record representing a sequence collection, which may be a Stub or Full.

Stub records contain only metadata (digest, n_sequences, level 1 digests). Full records contain metadata plus the list of SequenceRecord objects.

Attributes
metadata instance-attribute
metadata: SequenceCollectionMetadata
sequences property
sequences: Optional[List[SequenceRecord]]

Get the sequences if loaded (Full), None if stub-only.

Functions
has_sequences
has_sequences() -> bool

Check if this record has sequences loaded (is Full, not Stub).

__repr__
__repr__() -> str
__str__
__str__() -> str

RetrievedSequence

Represents a retrieved sequence segment with its metadata. Exposed from the Rust PyRetrievedSequence struct.

Attributes
sequence instance-attribute
sequence: str
chrom_name instance-attribute
chrom_name: str
start instance-attribute
start: int
end instance-attribute
end: int
Functions
__init__
__init__(sequence: str, chrom_name: str, start: int, end: int) -> None
__repr__
__repr__() -> str
__str__
__str__() -> str

StorageMode

Bases: Enum

Defines how sequence data is stored in the Refget store.

Attributes
Raw instance-attribute
Raw: int
Encoded instance-attribute
Encoded: int

RefgetStore

A global store for GA4GH refget sequences with lazy-loading support.

RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.

Attributes:

  • cache_path (Optional[str]) –

    Local directory path where the store is located or cached. None for in-memory stores.

  • remote_url (Optional[str]) –

    Remote URL of the store if loaded remotely, None otherwise.

Note

Boolean evaluation: RefgetStore follows Python container semantics, meaning bool(store) is False for empty stores (like list, dict, etc.). To check if a store variable is initialized (not None), use if store is not None: rather than if store:.

Example::

store = RefgetStore.in_memory()  # Empty store
bool(store)  # False (empty container)
len(store)   # 0

# Wrong: checks emptiness, not initialization
if store:
    process(store)

# Right: checks if variable is set
if store is not None:
    process(store)

Examples:

Create a new store and import sequences::

from gtars.refget import RefgetStore, StorageMode
store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")

Open an existing local store::

store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)

Open a remote store with caching::

store = RefgetStore.open_remote(
    "/local/cache",
    "https://example.com/hg38"
)
Attributes
cache_path instance-attribute
cache_path: Optional[str]
remote_url instance-attribute
remote_url: Optional[str]
Functions
__init__
__init__(mode: StorageMode) -> None

Create a new empty RefgetStore.

Parameters:

  • mode (StorageMode) –

    Storage mode - StorageMode.Raw (uncompressed) or StorageMode.Encoded (bit-packed, space-efficient).

Example::

store = RefgetStore(StorageMode.Encoded)
in_memory classmethod
in_memory() -> RefgetStore

Create a new in-memory RefgetStore.

Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.

Returns:

  • RefgetStore –

    New empty RefgetStore with Encoded storage mode.

Example::

store = RefgetStore.in_memory()
store.import_fasta("genome.fa")
on_disk classmethod
on_disk(cache_path: Union[str, PathLike]) -> RefgetStore

Create or load a disk-backed RefgetStore.

If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.

Parameters:

  • cache_path (Union[str, PathLike]) –

    Directory path for the store. Created if it doesn't exist.

Returns:

  • RefgetStore –

    RefgetStore (new or loaded from disk).

Example::

store = RefgetStore.on_disk("/data/my_store")
store.import_fasta("genome.fa")
# Store is automatically persisted to disk
open_local classmethod
open_local(path: Union[str, PathLike]) -> RefgetStore

Open a local RefgetStore from a directory.

Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().

Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi

Parameters:

  • path (Union[str, PathLike]) –

    Local directory containing the refget store.

Returns:

  • RefgetStore –

    RefgetStore with metadata loaded, sequences lazy-loaded.

Raises:

  • IOError –

    If the store directory or index files cannot be read.

Example::

store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)
open_remote classmethod
open_remote(cache_path: Union[str, PathLike], remote_url: str) -> RefgetStore

Open a remote RefgetStore with local caching.

Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.

By default, persistence is enabled (sequences are cached to disk). Call disable_persistence() after loading to keep only in memory.

Parameters:

  • cache_path (Union[str, PathLike]) –

    Local directory to cache downloaded metadata and sequences. Created if it doesn't exist.

  • remote_url (str) –

    Base URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38").

Returns:

  • RefgetStore –

    RefgetStore with metadata loaded, sequences fetched on-demand.

Raises:

  • IOError –

    If remote metadata cannot be fetched or cache cannot be written.

Example::

store = RefgetStore.open_remote(
    "/data/cache/hg38",
    "https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)
set_encoding_mode
set_encoding_mode(mode: StorageMode) -> None

Change the storage mode, re-encoding/decoding existing sequences as needed.

When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.

Parameters:

  • mode (StorageMode) –

    The storage mode to switch to (StorageMode.Raw or StorageMode.Encoded).

Example::

store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)
enable_persistence
enable_persistence(path: Union[str, PathLike]) -> None

Enable disk persistence for this store.

Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.

Parameters:

  • path (Union[str, PathLike]) –

    Directory for storing sequences and metadata.

Raises:

  • IOError –

    If the directory cannot be created or written to.

Example::

store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store")  # Flush to disk
disable_persistence
disable_persistence() -> None

Disable disk persistence for this store.

New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.

Example::

store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence()  # Stop caching new sequences
import_fasta
import_fasta(file_path: Union[str, PathLike]) -> None

Import sequences from a FASTA file into the store.

Reads all sequences from a FASTA file and adds them to the store. Computes GA4GH digests and creates a sequence collection.

Parameters:

  • file_path (Union[str, PathLike]) –

    Path to the FASTA file.

Raises:

  • IOError –

    If the file cannot be read or parsed.

Example::

store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")
list_collections
list_collections() -> List[SequenceCollectionMetadata]

List all collection metadata in the store.

Returns metadata for all collections without loading full collection data. Use this for browsing/inventory operations.

Returns:

Example::

for meta in store.list_collections():
    print(f"Collection {meta.digest}: {meta.n_sequences} sequences")
get_collection_metadata
get_collection_metadata(collection_digest: str) -> Optional[SequenceCollectionMetadata]

Get metadata for a collection by digest.

Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

Example::

meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
    print(f"Collection has {meta.n_sequences} sequences")
get_collection
get_collection(collection_digest: str) -> SequenceCollection

Get a collection by digest with all sequences loaded.

Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

Raises:

  • IOError –

    If the collection cannot be loaded.

Example::

collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
iter_collections
iter_collections() -> List[SequenceCollection]

Iterate over all collections with their sequences loaded.

This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.

For browsing without loading data, use list_collections() instead.

Returns:

Example::

for coll in store.iter_collections():
    print(f"{coll.digest}: {len(coll.sequences)} sequences")
is_collection_loaded
is_collection_loaded(collection_digest: str) -> bool

Check if a collection is fully loaded.

Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

Returns:

  • bool –

    True if loaded, False otherwise.

list_sequences
list_sequences() -> List[SequenceMetadata]

List all sequence metadata in the store.

Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.

Returns:

  • List[SequenceMetadata] –

    List of metadata for all sequences in the store.

Example::

for meta in store.list_sequences():
    print(f"{meta.name}: {meta.length} bp")
get_sequence_metadata
get_sequence_metadata(seq_digest: str) -> Optional[SequenceMetadata]

Get metadata for a sequence by digest (no data loaded).

Use this for lightweight lookups when you don't need the actual sequence.

Parameters:

  • seq_digest (str) –

    The sequence's SHA-512/24u digest.

Returns:

get_sequence
get_sequence(digest: str) -> Optional[SequenceRecord]

Retrieve a sequence record by its digest (SHA-512/24u or MD5).

Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest.

Parameters:

  • digest (str) –

    Sequence digest (SHA-512/24u base64url or MD5 hex string).

Returns:

  • Optional[SequenceRecord] –

    The sequence record with data if found, None otherwise.

Example::

record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
if record:
    print(f"Found: {record.metadata.name}")
    print(f"Sequence: {record.decode()[:50]}...")
get_sequence_by_name
get_sequence_by_name(collection_digest: str, sequence_name: str) -> Optional[SequenceRecord]

Retrieve a sequence by collection digest and sequence name.

Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed.

Parameters:

  • collection_digest (str) –

    The collection's SHA-512/24u digest.

  • sequence_name (str) –

    Name of the sequence within that collection.

Returns:

  • Optional[SequenceRecord] –

    The sequence record with data if found, None otherwise.

Example::

record = store.get_sequence_by_name(
    "uC_UorBNf3YUu1YIDainBhI94CedlNeH",
    "chr1"
)
if record:
    print(f"Sequence: {record.decode()[:50]}...")
iter_sequences
iter_sequences() -> List[SequenceRecord]

Iterate over all sequences with their data loaded.

This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.

For browsing without loading data, use list_sequences() instead.

Returns:

Example::

for seq in store.iter_sequences():
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
get_substring
get_substring(seq_digest: str, start: int, end: int) -> Optional[str]

Extract a substring from a sequence.

Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores).

Parameters:

  • seq_digest (str) –

    Sequence digest (SHA-512/24u).

  • start (int) –

    Start position (0-based, inclusive).

  • end (int) –

    End position (0-based, exclusive).

Returns:

  • Optional[str] –

    The substring sequence if found, None otherwise.

Example::

# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")
stats
stats() -> dict

Returns statistics about the store.

Returns:

  • dict –

    dict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded') - 'total_disk_size': Total size of all files on disk in bytes

Note

n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.

Example::

stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")
write_store_to_directory
write_store_to_directory(root_path: Union[str, PathLike], seqdata_path_template: str) -> None

Write the store to a directory on disk.

Persists the store with all sequences and metadata to disk using the RefgetStore directory format.

Parameters:

  • root_path (Union[str, PathLike]) –

    Directory path to write the store to.

  • seqdata_path_template (str) –

    Path template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest).

Example::

store.write_store_to_directory(
    "/data/my_store",
    "sequences/%s2/%s.seq"
)
get_seqs_bed_file
get_seqs_bed_file(collection_digest: str, bed_file_path: Union[str, PathLike], output_fasta_path: Union[str, PathLike]) -> None

Extract sequences for BED regions and write to FASTA.

Parameters:

  • collection_digest (str) –

    Collection digest to look up sequence names.

  • bed_file_path (Union[str, PathLike]) –

    Path to BED file with regions.

  • output_fasta_path (Union[str, PathLike]) –

    Path to write output FASTA file.

get_seqs_bed_file_to_vec
get_seqs_bed_file_to_vec(collection_digest: str, bed_file_path: Union[str, PathLike]) -> List[RetrievedSequence]

Extract sequences for BED regions and return as list.

Parameters:

  • collection_digest (str) –

    Collection digest to look up sequence names.

  • bed_file_path (Union[str, PathLike]) –

    Path to BED file with regions.

Returns:

export_fasta
export_fasta(collection_digest: str, output_path: Union[str, PathLike], sequence_names: Optional[List[str]] = None, line_width: Optional[int] = None) -> None

Export sequences from a collection to a FASTA file.

Parameters:

  • collection_digest (str) –

    Collection to export from.

  • output_path (Union[str, PathLike]) –

    Path to write FASTA file.

  • sequence_names (Optional[List[str]], default: None ) –

    Optional list of sequence names to export. If None, exports all sequences in the collection.

  • line_width (Optional[int], default: None ) –

    Optional line width for wrapping sequences. If None, uses default of 80.

export_fasta_by_digests
export_fasta_by_digests(digests: List[str], output_path: Union[str, PathLike], line_width: Optional[int] = None) -> None

Export sequences by their digests to a FASTA file.

Parameters:

  • digests (List[str]) –

    List of sequence digests to export.

  • output_path (Union[str, PathLike]) –

    Path to write FASTA file.

  • line_width (Optional[int], default: None ) –

    Optional line width for wrapping sequences. If None, uses default of 80.

__str__
__str__() -> str
__repr__
__repr__() -> str

Functions

sha512t24u_digest

sha512t24u_digest(readable: Union[str, bytes]) -> str

Compute the GA4GH SHA-512/24u digest for a sequence.

This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.

Parameters:

  • readable (Union[str, bytes]) –

    Input sequence as str or bytes.

Returns:

  • str –

    The SHA-512/24u digest (32 character base64url string).

Raises:

  • TypeError –

    If input is not str or bytes.

Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

md5_digest

md5_digest(readable: Union[str, bytes]) -> str

Compute the MD5 digest for a sequence.

This function computes the MD5 hash for a given sequence string or bytes. MD5 is supported for backward compatibility with legacy systems.

Parameters:

  • readable (Union[str, bytes]) –

    Input sequence as str or bytes.

Returns:

  • str –

    The MD5 digest (32 character hexadecimal string).

Raises:

  • TypeError –

    If input is not str or bytes.

Example:: from gtars.refget import md5_digest digest = md5_digest("ACGT") print(digest) # Output: 'f1f8f4bf413b16ad135722aa4591043e'

digest_fasta

digest_fasta(fasta: Union[str, PathLike]) -> SequenceCollection

Digest all sequences in a FASTA file and compute collection-level digests.

This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.

Parameters:

  • fasta (Union[str, PathLike]) –

    Path to FASTA file (str or PathLike).

Returns:

  • SequenceCollection –

    Collection containing all sequences with their metadata and computed digests.

Raises:

  • IOError –

    If the FASTA file cannot be read or parsed.

Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")

compute_fai

compute_fai(fasta: Union[str, PathLike]) -> List[FaiRecord]

Compute FASTA index (FAI) metadata for all sequences in a FASTA file.

This function computes the FAI index metadata (offset, line_bases, line_bytes) for each sequence in a FASTA file, compatible with samtools faidx format. Only works with uncompressed FASTA files.

Parameters:

  • fasta (Union[str, PathLike]) –

    Path to FASTA file (str or PathLike). Must be uncompressed.

Returns:

  • List[FaiRecord] –

    List of FAI records, one per sequence, containing name, length,

  • List[FaiRecord] –

    and FAI metadata (offset, line_bases, line_bytes).

Raises:

  • IOError –

    If the FASTA file cannot be read or is compressed.

Example:: from gtars.refget import compute_fai fai_records = compute_fai("genome.fa") for record in fai_records: ... print(f"{record.name}: {record.length} bp")

load_fasta

load_fasta(fasta: Union[str, PathLike]) -> SequenceCollection

Load a FASTA file with sequence data into a SequenceCollection.

This function reads a FASTA file and loads all sequences with their data into memory. Unlike digest_fasta(), this includes the actual sequence data, not just metadata.

Parameters:

  • fasta (Union[str, PathLike]) –

    Path to FASTA file (str or PathLike).

Returns:

  • SequenceCollection –

    Collection containing all sequences with their metadata and sequence data loaded.

Raises:

  • IOError –

    If the FASTA file cannot be read or parsed.

Example:: from gtars.refget import load_fasta collection = load_fasta("genome.fa") first_seq = collection[0] print(f"Sequence: {first_seq.data[:50]}...")

digest_sequence

digest_sequence(name: str, data: bytes, description: Optional[str] = None) -> SequenceRecord

Create a SequenceRecord from raw data, computing all metadata.

This is the sequence-level parallel to digest_fasta() for collections. It computes the GA4GH sha512t24u digest, MD5 digest, detects the alphabet, and returns a SequenceRecord with computed metadata and the original data.

The input data is automatically uppercased to ensure consistent digest computation (matching FASTA processing behavior).

Parameters:

  • name (str) –

    The sequence name (e.g., "chr1").

  • data (bytes) –

    The raw sequence bytes (e.g., b"ACGTACGT").

  • description (Optional[str], default: None ) –

    Optional description text for the sequence.

Returns:

  • SequenceRecord –

    A SequenceRecord with computed metadata and the original data (uppercased).

Example:: from gtars.refget import digest_sequence seq = digest_sequence("chr1", b"ACGTACGT") print(seq.metadata.name, seq.metadata.length) # Output: chr1 8

# With description
seq2 = digest_sequence("chr1", b"ACGT", description="Chromosome 1")
print(seq2.metadata.description)
# Output: Chromosome 1