BED classifier tutorial
BED classifier is a utility that allows you to classify BED files based on the number of columns and the types of data contained within those columns.
Information on various data formats can be found here: https://genome.ucsc.edu/FAQ/FAQformat.html
Additional detailed specifications for the Browser Extensible Data (BED) format can be found here: https://samtools.github.io/hts-specs/BEDv1.pdf
get_bed_classification
The function, get_bed_classification
, takes a path to bed-like file or a dataframe and returns a BedClassification
object with the following attributes:
class BedClassification(BaseModel):
bed_compliance: str
data_format: DATA_FORMAT
compliant_columns: int
non_compliant_columns: int
where DATA_FORMAT
is defined as:
class DATA_FORMAT(str, Enum):
UNKNOWN = "unknown_data_format"
UCSC_BED = "ucsc_bed"
BED_RS = "bed_rs"
BED_LIKE = "bed_like"
BED_LIKE_RS = "bed_like_rs"
ENCODE_NARROWPEAK = "encode_narrowpeak"
ENCODE_NARROWPEAK_RS = "encode_narrowpeak_rs"
ENCODE_BROADPEAK = "encode_broadpeak"
ENCODE_BROADPEAK_RS = "encode_broadpeak_rs"
ENCODE_GAPPEDPEAK = "encode_gappedpeak"
ENCODE_GAPPEDPEAK_RS = "encode_gappedpeak_rs"
ENCODE_RNA_ELEMENTS = "encode_rna_elements"
ENCODE_RNA_ELEMENTS_RS = "encode_rna_elements_rs"
Example usage of the BED classifier:
from bedboss.bedclassifier.bedclassifier import get_bed_classification
classification = get_bed_classification("path/to/bedfile.bed")
print(f"{classification.bed_compliance}, {classification.data_format}, {classification.compliant_columns}, {classification.non_compliant_columns}")
## Example 1
## > 'bed3+0', 'ucsc_bed', 3, 0
## Example 2
## > 'bed6+4', 'encode_narrowpeak', 6, 4
Data formats
Below rs
refers to relaxed_score
which indicates that a fifth column was present where the values are integers greater than 0. In constrast, a strict interpretation for column 5 is:
Column 5 - score - A score between 0 and 1000.
UNKNOWN
Classification was unable to determine the data format.
UCSC_BED
Conforms to ucsc bed
BED_RS
Conforms to ucsc bed but with a relaxed interpretation for the fifth column.
BED_LIKE
Data is tab delimited but contains columns that are not compliant with ucsc bed.
Example: bedn+m
where n
are compliant columns, m
are non-compliant columns and m > 0
BED_LIKE_RS
Data is tab delimited but contains columns that are not compliant with ucsc bed but with a relaxed interpretation for the fifth column.
Example:
bedn+m
where n
are compliant columns, m
are non-compliant columns and m > 0
, Column 5 = integer > 0
ENCODE_NARROWPEAK
Conforms to ENCODE narrowPeak
ENCODE_NARROWPEAK_RS
Conforms to ENCODE narrowPeak but with a relaxed interpretation for the fifth column.
ENCODE_BROADPEAK
Conforms to ENCODE broadPeak
ENCODE_BROADPEAK_RS
Conforms to ENCODE broadPeak but with a relaxed interpretation for the fifth column.
ENCODE_GAPPEDPEAK
Conforms to ENCODE gappedPeak
ENCODE_GAPPEDPEAK_RS
Conforms to ENCODE gappedPeak but with a relaxed interpretation for the fifth column.
ENCODE_RNA_ELEMENTS
Conforms to ENCODE RNA elements
ENCODE_RNA_ELEMENTS_RS
Conforms to ENCODE RNA elements but with a relaxed interpretation for the fifth column.