Help for Software Page
All custom software used to build network models, ChIP-Seq and other related analysis are freely available. Software page provides links to GitHub repositories that contains source files for these software
- DuffNGS: alignment and peak calling of ChIP-seq data
- DuffyTools: Custom R package for ChIP-seq peak filtering
- ChIPSeq Consolidation Processing: ChIP-seq processing code used in Minch et al. 2014
- ChromeGoose: A Gaggle plugin for Chrome
- cMonkey: cMonkey integrated biclustering algorithm
- cMonkeyNwInf: Network Inference for cMonkey output based on Inferelator and elastic-net
Manual for DuffyNGS Package
package:DuffyNGS Turn Alignments into Wiggle Tracks and ChIP Peaks.
Pipeline step that generates files of ChIP peaks for one sample. The alignment BAM files are converted to strand specific and unique/multi-hit wiggle tracks, and then the read pileup depth is interogated for the presence of ChIP peaks.
pipe.ChIPpeaks(sampleID, annotationFile = "Annotation.txt", optionsFile = "Options.txt",
speciesID = NULL, results.path = NULL, loadWIG = FALSE, storage.mode = "normal",
doPeakSearch = TRUE, peak.type = "auto", cutoff.medians = 3, canonical.width = 50,
use.multicore = TRUE, p.value = 0.25, verbose = FALSE,
watermark.gene = sub("_[BS][0-9]+$", "", sampleID),
controlPeaksFile = NULL, scaledReadCount = NULL, visualize = interactive())
sampleID: The SampleID for this sample. This SampleID keys for one row
of annotation details in the annotation file, for getting
sample-specific details. The SampleID is also used as a
sample-specific prefix for all files created during the
processing of this sample.
annotationFile: File of sample annotation details, which specifies all
needed sample-specific information about the samples under
study. See `DuffyNGS_Annotation`.
optionsFile: File of processing options, which specifies all processing
parameters that are not sample specific. See
speciesID: The SpeciesID of the target species to calculate a
transcriptome for. By default, transcriptome for all species
in the current target are generated.
results.path: The top level folder path for writing result files to.
By default, read from the Options file entry 'results.path'.
loadWIG: Logical, should all wiggle track data structures be rebuilt
from the alignment BAM files. By default, the wiggle files
are only created if they do not yet exist for this sample.
storage.mode: Controls the behavior of how alignments are loaded into
the the wiggle track data objects.
doPeakSearch: Logical, controls whether the full peak search algorithm
is run, or to instead just rerun the post-search
summarization and plotting steps of this pipeline.
peak.type: Controls the type of peak shapes the picker evaluates
against the raw pileup data. Choices are 'gaussian',
'gumbel', 'lorentz', and 'auto'. The default 'auto' lets the
peak picker use all known peak shapes and select the one that
best fits the raw data at each location.
cutoff.medians: Defines the cutoff threshold between noise and
potential true peaks, as a multiple of the strand's median
raw read pileup depth. Only locations having higher read
depth than the cutoff threshold will be passed to the lower
level peak picking algorithm. Setting the cutoff too low
will increase computation time and introduce false peaks, while
setting it too high will cause some true but shorter peaks to
canonical.width: Used to define the canonical 'half-width at
half-height' (HWHH) of an expected ChIP peak in the sample.
As the ChIP peaks themselves are a result of read pileups at
various locations along the genome, the best starting
estimate for the canonical width will be exactly the length
in nucleotides of the individual raw reads in the FASTQ data.
use.multicore: Logical, should the three strands of raw read pileup
data (forward, reverse, and combined) be run in parallel as
three separate child peak pick processes.
p.value: The limiting P-value for determining how many peaks to
include in the final tables of ChIP peak results. Peaks with
poor scoring final P-values are dropped.
verbose: Logical, send progress information to standard out.
watermark.gene: The GeneID of the induced gene for this sample. In
some experiments, we expect an overabundance of read pileup
data at the induced gene. The algorithm does a extra step of
median subtraction inside the gene boundary prior to peak
search, to help deconvolute the induction artifact from any
true ChIP peak in the area of the gene. Set to 'NULL' to
disable this behavior.
controlPeaksFile: Character filename of a set of negative control peaks
from samples that should not have real ChIP peaks. This set
of 'non-peaks' are used for calculating the P-values for
peaks in the given sample.
scaledReadCount: Total raw read count that all samples will be scaled
to prior to peak picking, to assure that all samples will be
scored in an equivalent manner. Many of the peak shape
scoring metrics are influenced by read depth, so this helps
give uniform peak scoring between samples.
visualize: Logical, should the pipeline generate plot images of peak
evaluation and final peak calls during the peak search run.
Plotting can only occur in an interactive session. Assumes
the program is running on a machine that supports X11
This is the high level pipeline step for finding ChIP peaks in a sample. If needed, it first turns the BAM alignments into wiggle track data; then calls the peak peaker, and finally generates several files of ChIP peak results.
The algorithm was written and tuned for the transcription factor induction data from Minch & Rustad (DOI: 10.1038/ncomms6829) of M.tuberculosis. It may need some retuning for other genomes and/or sequencing data variabilities.
Value:A family of files are written to this sample's subfolder in the 'ChIPpeaks' folder of results. These include:
Strand specific peak files: Three text files of single strand peak
calls, for the forward, reverse, and combined wiggle tracks.
Strand specific ROC curve plots: If interactive graphics were
generated, ROC curves of the peak pick score distributions
Final peaks text file: One text file of final peak calls, after
combining the 3 strand-specific results to find peak triplets
that satisfy the properties of a true ChIP peak.
Final peaks .csv file: One Excel readable file of final peak calls,
after appending a column of the genomic DNA sequence under
the peak. This column is configured as FASTA sequences that
can be submitted to a motif detection algorithm to find the
binding motif of that sample's transcription factor.
Final peaks .html file: One web browser readable file of final peak
calls, with hyperlinks into a subfolder of plot images of all
final ChIP peaks.
Peaks scoring details: One text file of all scoring metric details for
all identified peaks. Useful for tuning the algorithm or
just understanding why peaks got scored as they did.
References:The DNA-binding network of Mycobacterium tuberculosis KJ Minch, TR Rustad, et.al. Nature Communications 6, Article number: 5829, doi:10.1038/ncomms6829
See Also:`calcWigChIPpeaks` for the intermediate level implementation, and `findPeaks` for the low level single strand read pileup depth peak pick tool.
ChIP-seq Processing Documentation:
2014 Shuyi Ma
Used in Minch et al., (2014) The DNA binding network of Mycobacterium tuberculosis. In Press at Nature Communications.
This pipeline consolidates the individual peak information from the .csv files generated from analyzing the individual ChIP-seq experiments using Bob Morrison's pipeline. Peaks from replicate experiments are consolidated in this pipeline, and significant peaks from all experiments are extracted and concatenated.
The process is as follows:
2. To process TFs with single replicates, run:
This code will expect a file named 'samplefiles1.txt', containing all of the experiment labels for TFs with one replicate. The code will ask for the p-value threshold (e.g. 0.05)
The code will output 2 files:MTB_ChIPbindingnetworkinfo1.txt - contains the information for all significant peaks of the TFs with one replicate MTBChIPpeakoverlay1.txt - a vector of length of the size of the MTB genome, with each element denoting the number of peak footprints that fall into each base location.
3. For each of the remaining samplefilesX.txt files, run:
This code has the following inputs:
- - Will need to input the number of replicates described in samplefilesX.txt (i.e. what is X?)
- - Will need the name of the most recent MTBChIPpeakoverlayX.txt file that was generated
- - Will need the p-value threshold
This code will output 2 files:MTB_ChIPbindingnetworkinfoX.txt and MTBChIPpeakoverlayX.txt which are analogous to the output files generated from the -single code.
4. Each of the MTB_ChIPbindingnetworkinfoX.txt need to be concatenated. This can be done manually or through a script. One possible linux script would be:
cat "MTB_ChIPbindingnetworkinfo*" > 'MTBchipnetwork.txt'
5. (optional) To count the number of TFs that have peaks overlapping at a particular location, run:
This code will ask for a .txt file of the consolidated network generated from Step 4.
This code outputs a file with peak overlap counts for each peak in the input file.
1. Each experiment will be labeled as RvXXXX_BXXX, and the data are contained in a separate folder named for eaxh experiment. The .csv file is labeled as: 'RvXXXX_BXXX.MTb.ChIPpeaks.Excel.csv'. The user must first compile a .txt file of a list of all of the individual experiment labels to be combined (i.e. no control peaks, and only experiments of the correct experimental conditions should be included in this file).
Run samplefilesparsing.py and enter the .txt file containing the experiment labels.
This code will generate separate samplefilesX.txt files that contains the label names of each TF that has X number of replicates. TFs with multiple replicate experiments will be treated differently from TFs with only 1 replicate.