Software

All custom software used to build network models, ChIP-Seq and other related analysis are freely available. Software page provides links to GitHub repositories that contains source files for these software

Available Software

DuffNGS: alignment and peak calling of ChIP-seq data
DuffyTools: Custom R package for ChIP-seq peak filtering
ChIPSeq Consolidation Processing: ChIP-seq processing code used in Minch et al. 2014
ChromeGoose: A Gaggle plugin for Chrome
cMonkey: cMonkey integrated biclustering algorithm
cMonkeyNwInf: Network Inference for cMonkey output based on Inferelator and elastic-net

Manual for DuffyNGS Package

pipe.ChIPpeaks

package:DuffyNGS Turn Alignments into Wiggle Tracks and ChIP Peaks.

Description:

Pipeline step that generates files of ChIP peaks for one sample. The alignment BAM files are converted to strand specific and unique/multi-hit wiggle tracks, and then the read pileup depth is interogated for the presence of ChIP peaks.

Usage:


     pipe.ChIPpeaks(sampleID, annotationFile = "Annotation.txt", optionsFile = "Options.txt",
 
             speciesID = NULL, results.path = NULL, loadWIG = FALSE, storage.mode = "normal",
 
             doPeakSearch = TRUE, peak.type = "auto", cutoff.medians = 3, canonical.width = 50,
 
             use.multicore = TRUE, p.value = 0.25, verbose = FALSE, 

             watermark.gene = sub("_[BS][0-9]+$", "", sampleID), 

             controlPeaksFile = NULL, scaledReadCount = NULL, visualize = interactive())

Arguments:

sampleID: The SampleID for this sample. This SampleID keys for one row of annotation details in the annotation file, for getting sample-specific details. The SampleID is also used as a sample-specific prefix for all files created during the processing of this sample.

annotationFile: File of sample annotation details, which specifies all needed sample-specific information about the samples under study. See `DuffyNGS_Annotation`.

optionsFile: File of processing options, which specifies all processing parameters that are not sample specific. See `DuffyNGS_Options`.

speciesID: The SpeciesID of the target species to calculate a transcriptome for. By default, transcriptome for all species in the current target are generated.

results.path: The top level folder path for writing result files to. By default, read from the Options file entry 'results.path'.

loadWIG: Logical, should all wiggle track data structures be rebuilt from the alignment BAM files. By default, the wiggle files are only created if they do not yet exist for this sample.

storage.mode: Controls the behavior of how alignments are loaded into the the wiggle track data objects.

doPeakSearch: Logical, controls whether the full peak search algorithm is run, or to instead just rerun the post-search summarization and plotting steps of this pipeline.

peak.type: Controls the type of peak shapes the picker evaluates against the raw pileup data. Choices are 'gaussian', 'gumbel', 'lorentz', and 'auto'. The default 'auto' lets the peak picker use all known peak shapes and select the one that best fits the raw data at each location.

cutoff.medians: Defines the cutoff threshold between noise and potential true peaks, as a multiple of the strand's median raw read pileup depth. Only locations having higher read depth than the cutoff threshold will be passed to the lower level peak picking algorithm. Setting the cutoff too low will increase computation time and introduce false peaks, while setting it too high will cause some true but shorter peaks to be missed.

canonical.width: Used to define the canonical 'half-width at half-height' (HWHH) of an expected ChIP peak in the sample. As the ChIP peaks themselves are a result of read pileups at various locations along the genome, the best starting estimate for the canonical width will be exactly the length in nucleotides of the individual raw reads in the FASTQ data.

use.multicore: Logical, should the three strands of raw read pileup data (forward, reverse, and combined) be run in parallel as three separate child peak pick processes.

p.value: The limiting P-value for determining how many peaks to include in the final tables of ChIP peak results. Peaks with poor scoring final P-values are dropped.

verbose: Logical, send progress information to standard out.

watermark.gene: The GeneID of the induced gene for this sample. In some experiments, we expect an overabundance of read pileup data at the induced gene. The algorithm does a extra step of median subtraction inside the gene boundary prior to peak search, to help deconvolute the induction artifact from any true ChIP peak in the area of the gene. Set to 'NULL' to disable this behavior.

controlPeaksFile: Character filename of a set of negative control peaks from samples that should not have real ChIP peaks. This set of 'non-peaks' are used for calculating the P-values for peaks in the given sample.

scaledReadCount: Total raw read count that all samples will be scaled to prior to peak picking, to assure that all samples will be scored in an equivalent manner. Many of the peak shape scoring metrics are influenced by read depth, so this helps give uniform peak scoring between samples.

visualize: Logical, should the pipeline generate plot images of peak evaluation and final peak calls during the peak search run. Plotting can only occur in an interactive session. Assumes the program is running on a machine that supports X11 graphics.

Details:

This is the high level pipeline step for finding ChIP peaks in a sample. If needed, it first turns the BAM alignments into wiggle track data; then calls the peak peaker, and finally generates several files of ChIP peak results.

The algorithm was written and tuned for the transcription factor induction data from Minch & Rustad (DOI: 10.1038/ncomms6829) of M.tuberculosis. It may need some retuning for other genomes and/or sequencing data variabilities.

Value:

A family of files are written to this sample's subfolder in the 'ChIPpeaks' folder of results. These include:

Strand specific peak files: Three text files of single strand peak calls, for the forward, reverse, and combined wiggle tracks.

Strand specific ROC curve plots: If interactive graphics were generated, ROC curves of the peak pick score distributions are made.

Final peaks text file: One text file of final peak calls, after combining the 3 strand-specific results to find peak triplets that satisfy the properties of a true ChIP peak.

Final peaks .csv file: One Excel readable file of final peak calls, after appending a column of the genomic DNA sequence under the peak. This column is configured as FASTA sequences that can be submitted to a motif detection algorithm to find the binding motif of that sample's transcription factor.

Final peaks .html file: One web browser readable file of final peak calls, with hyperlinks into a subfolder of plot images of all final ChIP peaks.

Peaks scoring details: One text file of all scoring metric details for all identified peaks. Useful for tuning the algorithm or just understanding why peaks got scored as they did.

Author(s):

Bob Morrison

References:

The DNA-binding network of Mycobacterium tuberculosis KJ Minch, TR Rustad, et.al. Nature Communications 6, Article number: 5829, doi:10.1038/ncomms6829

ChIP-seq Processing Documentation:

2014 Shuyi Ma

Used in Minch et al., (2014) The DNA binding network of Mycobacterium tuberculosis. In Press at Nature Communications.

This pipeline consolidates the individual peak information from the .csv files generated from analyzing the individual ChIP-seq experiments using Bob Morrison's pipeline. Peaks from replicate experiments are consolidated in this pipeline, and significant peaks from all experiments are extracted and concatenated.

The process is as follows:

1. Each experiment will be labeled as RvXXXX_BXXX, and the data are contained in a separate folder named for eaxh experiment. The .csv file is labeled as: 'RvXXXX_BXXX.MTb.ChIPpeaks.Excel.csv'. The user must first compile a .txt file of a list of all of the individual experiment labels to be combined (i.e. no control peaks, and only experiments of the correct experimental conditions should be included in this file).

Run samplefilesparsing.py and enter the .txt file containing the experiment labels.

This code will generate separate samplefilesX.txt files that contains the label names of each TF that has X number of replicates. TFs with multiple replicate experiments will be treated differently from TFs with only 1 replicate.

2. To process TFs with single replicates, run:
ConsensusBindingNetworkExtractor-psingle.py
This code will expect a file named 'samplefiles1.txt', containing all of the experiment labels for TFs with one replicate. The code will ask for the p-value threshold (e.g. 0.05)

The code will output 2 files:
MTB_ChIPbindingnetworkinfo1.txt
- contains the information for all significant peaks of the TFs with one replicate
MTBChIPpeakoverlay1.txt
- a vector of length of the size of the MTB genome, with each element denoting the number of peak footprints that fall into each base location.
3. For each of the remaining samplefilesX.txt files, run:
ConsensusBindingNetworkExtractor-pmulti.py
This code has the following inputs:
- - Will need to input the number of replicates described in samplefilesX.txt (i.e. what is X?)
- - Will need the name of the most recent MTBChIPpeakoverlayX.txt file that was generated
- - Will need the p-value threshold
This code will output 2 files:
MTB_ChIPbindingnetworkinfoX.txt
and MTBChIPpeakoverlayX.txt
which are analogous to the output files generated from the -single code.
4. Each of the MTB_ChIPbindingnetworkinfoX.txt need to be concatenated. This can be done manually or through a script. One possible linux script would be:
cat "MTB_ChIPbindingnetworkinfo*" > 'MTBchipnetwork.txt'
5. (optional) To count the number of TFs that have peaks overlapping at a particular location, run:
ComparingBindingconsensus2.py
This code will ask for a .txt file of the consolidated network generated from Step 4.

This code outputs a file with peak overlap counts for each peak in the input file.

Help for Software Page