BED Classification
BED classification is a calculated metadata about RegionSet file that classifies RegionSet files based on the number of columns and the types of data contained within those columns.
BED classifier calculates 2 metadata fields: BED compliance and Data Format. (BED classifier tutorial can be found here: BED Classifier tutorial)
đĸ BED compliance (compliant_columns
+ non_compliant_columns
)
BED compliance is a string that indicates the number of compliant and non-compliant columns in the BED file.
It is represented as bedn+m
, where n
is the number of compliant columns, and m
is the number of non-compliant columns.
For example, bed3+0
indicates that there are 3 compliant columns and 0 non-compliant columns.
Example
bed3+0, bed7+3
bed3+0
indicates that there are 3 compliant columns and 0 non-compliant columns.
- bed7+3
indicates that there are 7 compliant columns and 3 non-compliant columns.
đĸ Data formats
The data format is a string that indicates flavor of the RegionSet format. We defined the following data formats:
class DATA_FORMAT(str, Enum):
UNKNOWN = "unknown_data_format"
UCSC_BED = "ucsc_bed"
UCSC_BED_RS = "ucsc_bed_rs"
BED_LIKE = "bed_like"
BED_LIKE_RS = "bed_like_rs"
ENCODE_NARROWPEAK = "encode_narrowpeak"
ENCODE_NARROWPEAK_RS = "encode_narrowpeak_rs"
ENCODE_BROADPEAK = "encode_broadpeak"
ENCODE_BROADPEAK_RS = "encode_broadpeak_rs"
ENCODE_GAPPEDPEAK = "encode_gappedpeak"
ENCODE_GAPPEDPEAK_RS = "encode_gappedpeak_rs"
ENCODE_RNA_ELEMENTS = "encode_rna_elements"
ENCODE_RNA_ELEMENTS_RS = "encode_rna_elements_rs"
Note
rs
refers to relaxed_score
which indicates that a fifth column was present where the values are integers greater than 0.
In contrast, a strict interpretation for column 5 is: A score between 0 and 1000.
-
UNKNOWN
Classification was unable to determine the data format.
-
UCSC_BED
Conforms to ucsc bed
-
UCSC_BED_RS
Conforms to ucsc bed but with a relaxed interpretation for the fifth column.
-
BED_LIKE
Data is tab delimited but contains columns that are not compliant with ucsc bed. Example:
bedn+m
wheren
are compliant columns,m
are non-compliant columns andm > 0
-
BED_LIKE_RS
Data is tab delimited but contains columns that are not compliant with ucsc bed but with a relaxed interpretation for the fifth column. Example:
bedn+m
wheren
are compliant columns,m
are non-compliant columns andm > 0
, Column 5 =integer > 0
-
ENCODE_NARROWPEAK
Conforms to ENCODE narrowPeak
-
ENCODE_NARROWPEAK_RS
Conforms to ENCODE narrowPeak but with a relaxed interpretation for the fifth column.
-
ENCODE_BROADPEAK
Conforms to ENCODE broadPeak
-
ENCODE_BROADPEAK_RS
Conforms to ENCODE broadPeak but with a relaxed interpretation for the fifth column.
-
ENCODE_GAPPEDPEAK
Conforms to ENCODE gappedPeak
-
ENCODE_GAPPEDPEAK_RS
Conforms to ENCODE gappedPeak but with a relaxed interpretation for the fifth column.
-
ENCODE_RNA_ELEMENTS
Conforms to ENCODE RNA elements
-
ENCODE_RNA_ELEMENTS_RS
Conforms to ENCODE RNA elements but with a relaxed interpretation for the fifth column.
âšī¸ References:
-
Information on various data formats can be found here: https://genome.ucsc.edu/FAQ/FAQformat.html
-
Additional detailed specifications for the Browser Extensible Data (BED) format can be found here: https://samtools.github.io/hts-specs/BEDv1.pdf
-
BEDBoss tutorial of bedclassifier can be found here: BED Classifier tutorial