Using bedshift to create an evaluation dataset for similarity scores

Generate different files

Bedshift perturbations include add, drop, shift, cut, and merge. Using any of these perturbations, or combinations of them, you can generate a set of files that are slightly perturbed from the original file. Assuming that the original file is called original.bed, and you want 100 files of added regions and 100 files of dropped regions:

bedshift -l hg38.chrom.sizes -b original.bed -a 0.1 -r 100
bedshift -l hg38.chrom.sizes -b original.bed -d 0.3 -r 100

Don't forget the add and shift operations require a chrom.sizes file. The output file will be in bedshifted_original.bed.

Evaluating a similarity score

This is when the bedshifted file will be put to use. The 100 repetitions of add and drop will be compared against the original file using the similarity score of your choice. The output of the similarity score should reflect the degree of change specified to bedshift. In very general terms, the pseudocode should be like this:

for each bedshift_file in folder:
    score = SimilarityScore(bedshift_file, original_file, ...)
    add score to score_list
avg_similarity_score = mean(score_list)

You can repeat this for each of the similarity scores and each of the perturbation combinations, and then compare the results. This way, you can get an accurate understanding of whether your similarity score reflects added regions, dropped regions, and more.

Using a PEP to quickly submit multiple bedshift jobs

Using a Portable Encapsulated Project (PEP), creating multiple combinations of bedshift files becomes faster and more organized. The PEP consists of a sample table containing the perturbation parameters and a config file. Here is what the sample_table.csv may look like. Each row specifies the arguments for a bedshift command.

sample_name	add	drop	shift	cut	merge
add1	0.1	0.0	0.0	0.0	0.0
add2	0.2	0.0	0.0	0.0	0.0
add3	0.3	0.0	0.0	0.0	0.0
drop-shift1	0.0	0.1	0.2	0.0	0.0
drop-shift2	0.0	0.2	0.2	0.0	0.0
drop-cut	0.0	0.3	0.0	0.4	0.0
shift-merge	0.0	0.0	0.4	0.0	0.4

And here is what the project_config.yaml file looks like:

pep_version: 2.0.0
sample_table: "sample_table.csv"
sample_modifiers:
  append:
    file: "original.bed"
    repeat: 100

Now the project is described neatly in two files. The sample_modifiers in the config file just adds extra columns to the sample table in post-processing and makes the project more configurable, instead of having to repeat the same parameter in the sample_table.csv. In this example, the sample_modifiers append two columns with the file which bedshift is to be performed on, and the number of repetitions that bedshift should create.

The PEP describes the project, but the tool that submits the project jobs is called looper. In one line of code, it will interpret the PEP and form commands to be submitted to your processor or computing cluster. To use looper, you will need to add a few lines to your project_config.yaml:

pep_version: 2.0.0
sample_table: "sample_table.csv"
looper:
  output_dir: "looper_output/"
sample_modifiers:
  append:
    pipeline_interfaces: "pipeline_interface.yaml"
    file: "original.bed"
    repeat: 100

You will also need to create a pipeline_interface.yaml that describes how to form commands:

pipeline_name: bedshift_run
pipeline_type: sample
command_template: >
    bedshift -b {sample.file} -l hg38.chrom.sizes -a {sample.add} -d {sample.drop} -s {sample.shift} -c {sample.cut} -m {sample.merge} -r {sample.repeat} -o {sample.sample_name}.bed
compute:
  mem: 4000
  cores: 1
  time: "00:10:00"

After all of this, the command to run looper and submit the jobs is:

looper run project_config.yaml

Soon, you should see bedshift files appear in the looper_output folder. The BED file names will correspond to the sample names from sample_table.csv.