Assess Module

assess

Functions

Modules

assess

Functions

run_all_assessment_methods

run_all_assessment_methods(raw_data_folder, file_list, universe, no_workers, folder_out, pref, save_each, overlap=False, distance_f_t_u=False, distance_f_t_u_flex=False, distance_u_t_f=False, distance_u_t_f_flex=False)

Assess universe fit to collection using overlap and distance metrics

Parameters:

Name	Type	Description	Default
`raw_data_folder`	`str`	path to raw files from the collection	required
`file_list`	`str`	path to file with list of files in the collection	required
`universe`	`str`	path to universe that is being assessed	required
`no_workers`	`int`	number of workers for multiprocessing	required
`folder_out`	`str`	output folder	required
`pref`	`str`	prefixed used for creating output files	required
`save_each`	`bool`	if save output of distance metrics for each region	required
`overlap`	`bool`	if calculate overlap metrics	`False`
`distance_f_t_u`	`bool`	if calculate distance from file to universe metrics	`False`
`distance_f_t_u_flex`	`bool`	if calculate flexible distance from file to universe metrics	`False`
`distance_u_t_f`	`bool`	if calculate distance from universes to file metrics	`False`
`distance_u_t_f_flex`	`bool`	if calculate flexible distance from universes to file metrics	`False`

get_rbs

get_rbs(f_t_u, u_t_f)

Calculate RBS

get_mean_rbs

get_mean_rbs(folder, file_list, universe, no_workers, flexible=False)

Calculate average RBS of the collection

Parameters:

Name	Type	Description	Default
`folder`	`str`	path to folder with the collection	required
`file_list`	`str`	path to file with list of files in the collection	required
`universe`	`str`	path to the universe	required
`no_workers`	`int`	number of workers for multiprocessing	required
`flexible`	`bool`	if to calculate flexible version of the metric	`False`

Returns:

Type	Description
	average RBS

get_rbs_from_assessment_file

get_rbs_from_assessment_file(file, cs_each_file=False, flexible=False)

Calculate RBS form file with results of metrics per file

Parameters:

Name	Type	Description	Default
`file`	`str`	path to file with assessment results	required
`cs_each_file`	`bool`	if report RBS for each file, not average for the collection	`False`
`flexible`	`bool`	if use flexible version of the metric	`False`

get_f_10_score

get_f_10_score(folder, file_list, universe, no_workers)

Get F10 score for a universes and collection of files

Parameters:

Name	Type	Description	Default
`folder`	`str`	path to folder with the collection	required
`file_list`	`str`	path to file with list of files in the collection	required
`universe`	`str`	path to the universe	required
`no_workers`	`int`	number of workers for multiprocessing	required

Returns:

Type	Description
	average F10 score

get_f_10_score_from_assessment_file

get_f_10_score_from_assessment_file(file, f10_each_file=False)

Get F10 score from assessment output file

Parameters:

Name	Type	Description	Default
`file`	`str`	path to file with assessment results	required
`f10_each_file`	`bool`	if report F10 for each file, not average for the collection	`False`

get_likelihood

get_likelihood(model_file, universe, cove_folder, cove_prefix='all', flexible=False, save_peak_input=False)

Calculate universe likelihood given collection

Parameters:

Name	Type	Description	Default
`model_file`	`str`	path to file with likelihood model	required
`universe`	`str`	path to the universe	required
`cove_folder`	`str`	path to the coverage folder	required
`cove_prefix`	`str`	prefixed used for generating coverage	`'all'`
`flexible`	`bool`	if to calculate flexible likelihood	`False`
`save_peak_input`	`bool`	if to save likelihood input of each region	`False`

Returns:

Type	Description

filter_universe

filter_universe(universe, universe_filtered, min_size=0, min_coverage=0, filter_lh=False, model_file=None, cove_folder=None, cove_prefix=None, lh_cutoff=0)

Filter universe by region size, coverage by collection, likelihood

Parameters:

Name	Type	Description	Default
`universe`	`str`	path to input universe	required
`universe_filtered`	`str`	path to output filtered universe	required
`min_size`	`int`	minimum size of the region in the output universe	`0`
`min_coverage`	`int`	minimum number coverage of universe region by collection	`0`
`filter_lh`	`bool`	if use likelihood to filter universe	`False`
`model_file`	`str`	path to collection likelihood model	`None`
`cove_folder`	`str`	path to folder with coverage tracks	`None`
`cove_prefix`	`str`	prefixed used for creating tracks	`None`
`lh_cutoff`	`int`	minimum likelihood input	`0`

cli

Functions

build_subparser

build_subparser(parser)

Builds argument parser.

Returns:

Type	Description
	Argument parser

distance

Functions

flexible_distance_between_two_regions

flexible_distance_between_two_regions(region, query)

Calculate distance between region and flexible region from flexible universe

Parameters:

Name	Type	Description	Default
`region`		region from flexible universe	required
`query`	`int`	analyzed region	required

Returns:

Type	Description
	distance

distance_between_two_regions

distance_between_two_regions(region, query)

Calculate distance between region in database and region from the query

Parameters:

Name	Type	Description	Default
`region`	`[int]`	region from hard universe	required
`query`	`int`	analysed region	required

Returns:

Type	Description
	distance

distance_to_closest_region

distance_to_closest_region(db, db_queue, i, current_chrom, unused_db, pos_index, flexible, uni_to_file)

Calculate distance from given peak to the closest region in database

Parameters:

Name	Type	Description	Default
`db`	`file`	database file	required
`db_queue`	`list`	queue of three last positions in database	required
`i`		analyzed position from the query	required
`current_chrom`	`str`	current analyzed chromosome from query	required
`unused_db`	`list`	list of positions from universe that were not compared to query	required
`pos_index`	`list`	which indexes from universe region use to calculate distance	required
`flexible`	`bool`	whether the universe if flexible	required
`uni_to_file`	`bool`	whether calculate distance from universe to file	required

Returns:

Type	Description
	peak distance to universe

read_in_new_universe_regions

read_in_new_universe_regions(db, q_chrom, current_chrom, unused_db, db_queue, waiting, pos_index)

Read in new universe regions closest to the peak

Parameters:

Name	Type	Description	Default
`db`	`file`	universe file	required
`q_chrom`	`str`	new peak's chromosome	required
`current_chrom`	`str`	chromosome that was analyzed so far	required
`unused_db`	`list`	list of positions from universe that were not compared to query	required
`db_queue`	`list`	que of three last positions in universe	required
`waiting`	`bool`	whether iterating through file, without calculating distance, if present chromosome not present in universe	required
`pos_index`	`list`	which indexes from universe region use to calculate distance	required

Returns:

Type	Description
	if iterating through chromosome not present in universe; current chromosome in query

calc_distance_between_two_files

calc_distance_between_two_files(universe, q_folder, q_file, flexible, save_each, folder_out, pref, uni_to_file=False)

Maine function for calculating distance between regions in file query to regions in database

Parameters:

Name	Type	Description	Default
`universe`	`str`	path to universe	required
`q_folder`	`str`	path to folder containing query files	required
`q_file`	`str`	query file	required
`flexible`	`boolean`	whether the universe if flexible	required
`save_each`	`bool`	whether to save calculated distances for each file	required
`folder_out`	`str`	output folder	required
`pref`	`str`	prefix used as the name of the folder containing calculated distance for each file	required
`uni_to_file`		whether to calculate distance from universe to file	`False`

Returns:

Type	Description
	file name; median od distance of starts to starts in universe; median od distance of ends to ends in universe

run_distance

run_distance(folder, file_list, universe, no_workers, flexible=False, folder_out=None, pref=None, save_each=False, uni_to_file=False)

For group of files calculate distance to the nearest region in universe

Parameters:

Name	Type	Description	Default
`folder`	`str`	path to folder containing query files	required
`file_list`	`str`	path to file containing list of query files	required
`universe`	`str`	path to universe file	required
`no_workers`	`int`	number of parallel processes	required
`flexible`	`bool`	whether the universe if flexible	`False`
`folder_out`	`str`	output folder	`None`
`pref`	`str`	prefix used for saving	`None`
`save_each`	`bool`	whether to save calculated distances for each file	`False`
`uni_to_file`		whether to calculate distance from universe to file	`False`

Returns:

Type	Description
	mean of median distances from starts in query to the nearest starts in universe; mean of median distances from ends in query to the nearest ends in universe

intersection

Functions

chrom_cmp

chrom_cmp(a, b)

Return smaller chromosome name

relationship_helper

relationship_helper(region_a, region_b, only_in, overlap)

For two region calculate their overlap; for earlier region calculate how many base pair only in it

Parameters:

Name	Type	Description	Default
`region_a`		region that starts first	required
`region_b`		region that starts second	required
`only_in`	`int`	number of positions only in a so far	required
`overlap`	`int`	number of overlapping so far	required

two_region_intersection_diff

two_region_intersection_diff(region_d, region_q, only_in_d, only_in_q, inside_d, inside_q, overlap, start_d, start_q, waiting_d, waiting_q)

Check mutual position of two regions and calculate intersection and difference of two regions

Parameters:

Name	Type	Description	Default
`region_d`	`list`	region from universe	required
`region_q`	`list`	region from query	required
`only_in_d`	`int`	number of base pair only in universe	required
`only_in_q`	`int`	number of base pair only in query	required
`inside_d`	`bool`	whether there is still part of the region from universe to analyse	required
`inside_q`	`bool`	whether there is still part of the region from query to analyse	required
`overlap`	`int`	size of overlap	required
`start_d`	`int`	start position of currently analyzed universe region	required
`start_q`	`int`	start position of currently analyzed query region	required
`waiting_d`	`bool`	whether waiting for the query to finish chromosome	required
`waiting_q`	`bool`	whether waiting for the universe to finish chromosome	required

read_in_new_line

read_in_new_line(region, start, chrom, inside, waiting, lines, c_chrom, not_e)

Read in a new line from query or universe file

calc_diff_intersection

calc_diff_intersection(db, folder, query)

Difference and overlap of two files on base pair level

Parameters:

Name	Type	Description	Default
`db`	`str`	path to universe file	required
`folder`	`str`	path to folder with query file	required
`query`	`str`	query file name	required

Returns:

Type	Description
	file name; bp only in universe; bp only in query; overlap in bp

run_intersection

run_intersection(folder, file_list, universe, no_workers)

Calculate the base pair intersection of universe and group of files

Parameters:

Name	Type	Description	Default
`folder`	`str`	path to folder containing query files	required
`file_list`	`str`	path to file containing list of query files	required
`universe`	`str`	path to universe file	required
`no_workers`	`int`	number of parallel processes	required
`save_to_file`	`str`	whether to save median of calculated distances for each file	required
`folder_out`	`str`	output folder	required
`pref`	`str`	prefix used for saving	required

Returns:

Type	Description
	mean of fractions of intersection of file and universe divided by universe size; mean of fractions of intersection of file and universe divided by file size

likelihood

Classes

LhModel

LhModel(model, cove)

Object with combined information about lh model and coverage

Parameters:

Name	Type	Description	Default
`model`	`ndarray`	lh model array	required
`cove`	`ndarray`	coverage array	required

Functions

calc_likelihood_hard

calc_likelihood_hard(universe, chroms, model_lh, coverage_folder, coverage_prefix, name, s_index, e_index=None)

Calculate likelihood of universe for given type of model To be used with binomial model

Parameters:

Name	Type	Description	Default
`universe`		path to universe file	required
`chroms`	`list`	list of chromosomes present in model	required
`model_lh`	`ModelLH`	likelihood model	required
`coverage_prefix`		prefix used in uniwig for creating coverage	required
`coverage_folder`		path to a folder with genome coverage by tracks	required
`name`	`str`	suffix of model file name, which contains information about model type	required
`s_index`	`int`	from which position in universe line take assess region start position	required
`e_index`	`int`	from which position in universe line take assess region end position	`None`

Returns:

Type	Description
	likelihood of universe for given model

hard_universe_likelihood

hard_universe_likelihood(model, universe, coverage_folder, coverage_prefix)

Calculate likelihood of hard universe based on core, start, end coverage model

Parameters:

Name	Type	Description	Default
`model`	`str`	path to file containing model	required
`universe`	`str`	path to universe	required
`coverage_prefix`		prefix used in uniwig for creating coverage	required
`coverage_folder`		path to a folder with genome coverage by tracks	required

Returns:

Type	Description
	likelihood

likelihood_only_core

likelihood_only_core(model_file, universe, coverage_folder, coverage_prefix)

Calculate likelihood of universe based only on core coverage model

Parameters:

Name	Type	Description	Default
`model_file`	`str`	path to name containing model	required
`universe`	`str`	path to universe	required
`coverage_prefix`		prefix used in uniwig for creating coverage	required
`coverage_folder`		path to a folder with genome coverage by tracks	required

Returns:

Type	Description
	likelihood

background_likelihood

background_likelihood(start, end, model_start, model_cove, model_end)

Calculate likelihood of background for given region

weigh_livelihood

weigh_livelihood(start, end, model_process, model_cove, model_out, reverse)

Calculate weighted likelihood of flexible part of the region

Parameters:

Name	Type	Description	Default
`start`	`int`	start of the region	required
`end`	`int`	end of the region	required
`model_process`	`array`	model for analyzed type of flexible region	required
`model_cove`	`array`	model for coverage	required
`model_out`	`array`	model for flexible region that is not being analyzed	required
`reverse`	`bool`	if model_process corespondents to end we have to reverse the weighs	required

Returns:

Type	Description
	likelihood of flexible part of the region

likelihood_flexible_universe

likelihood_flexible_universe(model_file, universe, cove_folder, cove_prefix, save_peak_input=False)

Likelihood of given universe under the model

Parameters:

Name	Type	Description	Default
`model_file`	`str`	path to file with lh model	required
`universe`	`str`	path to universe	required
`cove_folder`		path to a folder with genome coverage by tracks	required
`cove_prefix`		prefix used in uniwig for creating coverage	required
`save_peak_input`	`bool`	whether to save universe with each peak lh	`False`

Returns:

Type	Description
	lh of the flexible universe

utils

Functions

prep_data

prep_data(folder, file, tmp_file)

File sort and merge

check_if_uni_sorted

check_if_uni_sorted(universe)

Check if regions in file are sorted

process_line

process_line(line)

Helper for reading in bed file line

chrom_cmp_bigger

chrom_cmp_bigger(a, b)

Natural check if chromosomes name is bigger

process_db_line

process_db_line(dn, pos_index)

Helper for reading in universe bed file line