How to assess universe fit to collection of BED files¶
Introduction¶
In this tutorial, you will see how to assess a fit of a given universe to a collection of files. (Tutorial on creating different universes from files can be found here and here.) Choosing, which universe represents data the best can be challenging. To help with this decision we created three different metrics for assessing universe fit to the region collections: a base-level overlap score, a region boundary score, and a likelihood score. Fit of a universe can be assessed both using CLI and python functions depending on use case. With CLI you can create a file with values of universe assessment methods for each file within the collection, while with python functions you can get measures of universe fit to the whole collection.
CLI¶
Using CLI you can calculate both base-level overlap score and region boundary score separately for each file in the collections and than summarized. To calculate them you need raw files as well as the analyzed universe. It is also necessary to choose at least one assessment metric to be calculated:
--overlap
- to calculate base pair overlap between universe and regions in the file, number of base pair only in the universe, number of base pair only in the file, which can be used to calculate F10 score;--distance
- to calculate median of distance form regions in the raw file to the universe;--distance-universe-to-file
- to calculate median of distance form the universe to regions in the raw file;--distance-flexible
- to calculate median of distance form regions in the raw file to the universe taking into account universe flexibility;--distance-flexible-universe-to-file
- - to calculate median of distance form the universe to regions in the raw file taking into account universe flexibility.
Here we present an example, which calculates all possible metrics for HMM universe:
geniml assess-universe --raw-data-folder raw/ \
--file-list file_list.txt \
--universe universe_hmm.bed \
--folder-out . \
--pref test_assess \
--overlap \
--distance \
--distance-universe-to-file \
--distance-flexible \
--distance-flexible-universe-to-file
The resulting file is called test_assess_data.csv, and contains columns with the raw calculated metrics for each file: file, univers/file, file/universe, universe&file, median_dist_file_to_universe, median_dist_file_to_universe_flex, median_dist_universe_to_file, median_dist_universe_to_file_flex.
Python functions¶
The file created with CLI can be further summarized into specific metrics assessing the fit of a universe to a whole collection such as: a base-level overlap score (F10), a region boundary distance score (RBD).
from geniml.assess.assess import get_rbs_from_assessment_file, get_f_10_score_from_assessment_file
import pandas as pd
assessment_file_path = "test_assess_data.csv"
df = pd.read_csv(assessment_file_path)
df.head()
file | univers/file | file/universe | universe&file | median_dist_file_to_universe | median_dist_file_to_universe_flex | median_dist_universe_to_file | median_dist_universe_to_file_flex | |
---|---|---|---|---|---|---|---|---|
0 | test_1.bed | 2506 | 403 | 3630 | 27.0 | 0.0 | 76.5 | 0.0 |
1 | test_2.bed | 1803 | 146 | 4333 | 27.0 | 0.0 | 70.0 | 7.5 |
2 | test_3.bed | 2949 | 0 | 3187 | 28.0 | 0.0 | 225.0 | 224.5 |
3 | test_4.bed | 2071 | 546 | 4065 | 27.0 | 0.0 | 116.5 | 105.5 |
rbs = get_rbs_from_assessment_file(assessment_file_path)
f_10 = get_f_10_score_from_assessment_file(assessment_file_path)
rbs_flex = get_rbs_from_assessment_file(assessment_file_path, flexible=True)
print(f"Universe \nF10: {f_10:.2f}\nRBS: {rbs:.2f}\nflexible RBS: {rbs_flex:.2f}")
Universe F10: 0.93 RBS: 0.77 flexible RBS: 0.98
Or all of this metrics can be directly calculated from the universe and raw files including a likelihood score (LH):
from geniml.assess.assess import get_f_10_score
f10 = get_f_10_score(
"raw/",
'file_list.txt',
"universe_hmm.bed",
1)
f"Universe F10: {f10:.2f}"
'Universe F10: 0.93'
from geniml.assess.assess import get_mean_rbs
rbs = get_mean_rbs("raw/",
'file_list.txt',
"universe_hmm.bed", 1)
f"Universe RBS: {rbs:.2f}"
'Universe RBS: 0.77'
from geniml.assess.assess import get_likelihood
lh = get_likelihood(
"model.tar",
"universe_hmm.bed",
"coverage/"
)
f"Universe LH: {lh:.2f}"
'Universe LH: -127156.87'
Both region boundary score and likelihood can be also calculated taking into account universe flexibility:
from geniml.assess.assess import get_mean_rbs
rbs_flex = get_mean_rbs(
"raw/",
'file_list.txt',
"universe_hmm.bed",
1,
flexible=True)
f"Universe flexible RBS: {rbs_flex:.2f}"
'Universe flexible RBS: 0.98'
lh_flex = get_likelihood(
"model.tar",
"universe_hmm.bed",
"coverage/"
)
f"Universe flexible LH: {lh_flex:.2f}"
'Universe flexible LH: -127156.87'