Randomizing BED files with bedshift

Bedshift is a tool for randomly perturbing BED file regions. The perturbations supported on regions are shift, drop, add, cut, and merge. This tool is particularly useful for creating test datasets for various tasks, since there often is no ground truth dataset to compare to. By perturbing a file, a pipeline or analysis can be run on both the perturbed file and the original file, then be compared.

Installing

Bedshift is part of the geniml package distributed on PyPI.

pip install geniml

Quickstart

The package is available for use both as a command line interface and a python package. To get started, type on the command line:

geniml bedshift -h

The following examples will shift 10% of the regions and add 10% new regions in examples/test.bed. The -l argument is the file in which chromosome sizes are located, and is only required for adding and/or shifting regions. The output is located at bedshifted_test.bed.

CLI:

geniml bedshift -l hg38.chrom.sizes -b tests/test.bed -s 0.1 -a 0.1

Python:

from geniml.bedshift import bedshift

bedshifter = bedshift.Bedshift('tests/test.bed', 'hg38.chrom.sizes')
bedshifter.shift(shiftrate=0.1, shiftmean=0.0, shiftstdev=120.0)
bedshifter.add(addrate=0.1, addmean=320.0, addstdev=20.0)
bedshifter.to_bed('tests/test_output.bed')

Example Repository

If you're looking to use Bedshift in your own experiment, we created an example repository containing working code to:

Produce a large dataset of Bedshift files
Run a pipeline on the dataset and obtain results
Aggregate and visualize the results

It integrates the PEP and looper workflow allowing you to easily run the project out of the box.

Generate a random BED file

To generate a random BED file, you need to start with is a BED file with one region, which you will delete later. (It will appear at the top of the BED files, so it will be easier to delete later.) On the command line:

echo -e "chr1\t1\t1000" > random.bed

The next step is to construct a bedshift command to add regions to this file. We will need to specify the rate of add, which in this case is going to be the number of the new regions we want. Let's try generating 1,000 new regions. Don't forget to specify a chromosome sizes file, which is required for adding regions.

geniml bedshift -b random.bed -l hg38.chrom.sizes -a 1000

The printed output should say that 1000 regions changed. Finally, go into the output at bedshifted_random.bed and delete the original region. If you specified repeats and have many output files that need to have the original region deleted, here is a handy command to delete the first line of every BED file. (Warning: make sure there are no other BED files in the folder before using this command.)

find . -name "*.bed" -exec sed -i.bak '1d' {} \;
# if your system is macOS
# find . -name "*.bed" -exec sed -i .bak '1d' {} \;

This find command will find all BED files and execute a sed command to remove the first line. The sed command will operate in place and create .bak backup files, which can be removed later.

Shift, Add, and Drop From File

"From file" means that the regions selected to shift, add, or drop are specified from a provided file. These features provide the ability to finely control what regions are perturbed. For example, if you have a BED file specifying exon regions and you want to add only exons, you can use --addfile.

Add from file example

geniml bedshift -b mydata.bed -a 0.07 --addfile exons.bed

Specifying --addfile with -a add rate will increase the size of mydata.bed by 7% with new regions selected from exons.bed.

Shift from file example

Shift from file first calculates which regions overlap between the specified --shiftfile and --bedfile, then selects which regions to shift among those overlaps.

geniml bedshift -b mydata.bed -l hg38.chrom.sizes -s 0.42 --shiftmean 5 --shiftstdev 5 --shiftfile snp.bed

In this example, we only want to shift regions that are SNPs. The number of shifted regions is 42% of the total regions in mydata.bed. Notice here that unlike --addfile, we still have to specify the shift mean and standard deviation. This is because --shiftfile tells which regions to shift, but not by how much.

Drop from file example

Drop from file, like shift from file, calculates overlaps between the specified --dropfile and --bedfile, then selects regions from those overlaps to drop.

geniml bedshift -b mydata.bed -d 0.4 --dropfile snp.bed

This command will drop regions that overlap with SNPs. The number of dropped regions is 40% of the total regions in mydata.bed.

Use a YAML File to Specify Perturbations

Sometimes the default settings of bedshift does not allow enough control over perturbations. For example, the order of perturbations is fixed as shift, add, cut, merge, drop, so if you wanted to change the order you would have to specify multiple commands. The same problem arises when you want to run multiple "add from file" commands - there is just no way to do it using a single command.

This is why we created the YAML config file perturbation option. In the YAML file, users can specify as many perturbations as they want, along with the parameters specific to each perturbation. An example of a YAML config file follows:

bedshift_operations:
  - add_from_file:
    file: exons.bed
    rate: 0.2
  - add_from_file:
    file: snp.bed
    rate: 0.05
  - shift_from_file:
    file: exons.bed
    rate: 0.4
    mean: 100
    stdev: 85
  - shift_from_file:
    file: snp.bed
    rate: 0.4
    mean: 2
    stdev: 1
  - merge:
    rate: 0.15

The order of perturbations is run in the same order they are specified. So in this example, we add from two different files, then also shift those regions that were just added. Finally we perform a merge at 15% rate.

How to bedshift all files in a directory

Using shell

Assuming you are bedshifting all files in the current working directory using the same parameter, use the following shell script (changing parameters as needed), which iterates over files in the directory and applies bedshift:

#!/bin/bash
for filename in *.bed; do
    CHROM_LENGTHS=hg38.chrom.sizes
    BEDFILE=$filename
    DROP_RATE=0.3

    ADD_RATE=0.2
    ADD_MEAN=320.0
    ADD_STDEV=30.0

    SHIFT_RATE=0.2
    SHIFT_MEAN=0.0
    SHIFT_STDEV=150.0

    CUT_RATE=0.0
    MERGE_RATE=0.0

    geniml bedshift --bedfile $BEDFILE --chrom-lengths $CHROM_LENGTHS --droprate $DROP_RATE --addrate $ADD_RATE --addmean $ADD_MEAN --addstdev $ADD_STDEV --shiftrate $SHIFT_RATE --shiftmean $SHIFT_MEAN --shiftstdev $SHIFT_STDEV --cutrate $CUT_RATE --mergerate $MERGE_RATE
done

Using Python

In Python, you need the os library to get the filenames in a directory. Then you loop through the filenames and apply bedshift.

from geniml.bedshift import bedshift
import os

files = os.listdir('/path/to/data/')
for file in files:
    if file.endswith('.bed'):
        # you may also pass in a chrom.sizes file as the 
        # second argument if you are adding or shifting regions
        b = bedshift.Bedshift(file)
        b.all_perturbations(cutrate=0.3, droprate=0.2)
        b.to_bed('bedshifted_' + file)

Add Random Regions Only in Valid Regions

Using the basic --add option, regions are added randomly onto any chromosome at any location, without any regard for non-coding regions. For use cases of Bedshift more rooted in biology, this effect is not desirable. The --add-valid option gives the user the ability to specify a BED file indicating areas where it is valid to add regions. Thus, if an --add-valid file has only coding regions, then regions will be randomly added only in those areas. Here is an example:

geniml bedshift -b mydata.bed -a 0.5 --add-valid coding.bed --addmean 500 --addstdev 200

coding.bed contains large regions of the genome which are coding. Added regions can be anywhere inside of those regions. In addition, the method considers the size of the valid regions in deciding where the new regions will be added, so the smaller valid regions will contain proportionally less new regions than the larger valid regions.