14 May 2013:
Nesoni is free software, released under the GPL (version 2).
Nesoni is a high-throughput sequencing data analysis toolset, which the VBC has developed to cope with the flood of Illumina, 454, and SOLiD data now being produced.
Our work is largely with bacterial genomes, and the design tradeoffs in nesoni reflect this.
Alignment to reference
Nesoni focusses on analysing the alignment of reads to a reference genome. We use the SHRiMP read aligner, as it is able to detect small insertions and deletions in addition to SNPs.
Nesoni can call a consensus of read alignments, taking care to indicate ambiguity. This can then be used in various ways: to determine the protein level changes resulting from SNPs and indels, to find differences between multiple strains, or to produce n-way comparison data suitable for phylogenetic analysis in SplitsTree4.
Alternatively, the raw counts of bases at each position in the reference seen in two different sequenced strains can compared using Fisher's Exact Test.
Nesoni includes tools that make it easy to write parallel processing pipelines in Python.
Pipelines are expressed as Python functions. The translation of a serial program with for-loops and function calls into a parallel program requires only simple localized modifications to the code.
Pipelines expressed in this way are composable, just like ordinary functions.
Much like make, the resultant program will only re-run tools as necessary if parameters are changed. The dependancy structure is implicit from the parallel program, if a tool needs to be re-run, only things that must execute after that tool also need re-running.
k-mer and De Bruijn graph tools
Nesoni also includes some highly experimental tools for working with sets of k-mers and De Bruijn graphs. You can:
- Produce a 2D layout of a De Bruijn graph, and interact with it: zoom in to examine details, overlay read-pair data, overlay sequences on top of it (for example, to examine the behaviour of a de-novo assembler such as Velvet).
- Clip a set of reads to remove low-frequency or high-frequency k-mers, or k-mers where there is a more frequent k-mer differing by one SNP.
This poster, presented at BA2009, gives an overview of Nesoni's capabilities.
Nesoni provides the following specific usage information when run with no parameters:
nesoni 0.104 - high-throughput sequencing data analysis toolset Usage: nesoni <tool>: ... Give <tool>: without further arguments for help on using that tool. Alignment to reference -- core tools: make-reference: - Set up a directory containing a reference sequence, annotations, and files for SHRiMP and/or Bowtie. shrimp: - Run SHRiMP 2 on a read set to set up a working directory. bowtie: - Run Bowtie 2 on a read set to set up a working directory. consensus - Filter read hits, and try to call a consensus for each position in reference. (import: - Pipe SAM alignments to set up a working directory) (filter: - Filter read hits, but do not call consensus) (reconsensus: - Re-call consensus, using previously filtered hits) Alignment to reference -- VCF based tools: (under development) These provide an alternative to consensus calling using "nesoni consensus:" - better handling of complicated Multi-Nucleotide Polymorphisms - can't distinguish between absence of a variant and insufficient data (but can distinguish absence of a variant from insufficient data in a single sample if variant present in other samples) freebayes: - Run FreeBayes to produce a VCF file. vcf-filter: - Filter a VCF file, eg as produced by "nesoni freebayes:". snpeff: - Run snpEff to annotate variants with their effects. vcf-nway: - Summarize a VCF file in a variety of possible ways. vcf-patch: - Patch in variants to produce genome of samples. (similar to consensus_masked.fa produced by "nesoni consensus:") test-variant-call: - Generate synthetic reads, see what variant is called. power-variant-call: - Apply "neosni test-variant-call:" to a variety of different variants over a range of depths. Alignment to reference -- analysis tools: igv-plots: - Generate plots for IGV. nway: - Compare results of two or more runs of nesoni consensus, amongst themselves and optionally with the reference. Can produce output suitable for phylogenetic analysis in SplitsTree4. fisher: - Compare results of two runs of nesoni consensus using Fisher's Exact Test for each site in the reference. normalize: - Create normalized Artemis depth plots. See also "igv-plots:". core: - Infer core genome present in a set of strains. (consequences: - Determine effects at the amino acid level of SNPs and INDELs called by nesoni consensus. Most of the features of this tool are now a part of "samconsensus:".) Alignment to reference -- differential expression: count: - Count number of alignments to genes, using output from "shrimp:". test-counts: - Use edgeR or limma from BioConductor to detect differentially expressed genes, using output from samcount. test-power: - Test the statistical power of "nesoni test-counts:" with simulated data. plot-counts: - Plot counts against each other. norm-from-counts: - Calculate normalizing multipliers from counts. heatmap: - Draw a heat map of counts. nmf: - Perform a Non-negative Matrix Factorization of counts. NMF is a type of fuzzy clustering. compare-tests: - Compare the output from two runs of "test-counts:" eg to compare the results of different "--mode"s An R+ module is included with nesoni which will help load the output from samcount, for analysis with BioConductor packages. Peak calling and annotation manipulation tools: islands: transcripts: modes: - Various peak and transcript calling algorithms. modify-features: - Shift start or end position of features, filter by type, change type. collapse-features: - Merge overlapping features. relate-features: - Find features from one set that are near or overlapping features from another set. as-gff: - Output an annotation in GFF format, optionally filtering by annotation type. k-mer tools: (experimental) bag: - Create an index of kmers in a read set for analysis with nesoni graph. graph: - Use a bag or bags to lay out a deBruijn graph. Interact with the graph in various ways. Utility tools: clip: - Remove Illumina adaptor sequences and low quality bases from reads. shred: - Break a sequence into small overlapping pieces. In case you want to run an existing sequence through the above tools. Yes, this isn't ideal. as-fasta: - Output a sequence file in FASTA format. as-userplots: - Convert a .igv file to a set of .userplot files for viewing in Artemis. make-genome: - Make an IGV .genome file. run-igv: - Run IGV with a specified .genome file. sample: - Randomly sample from a sequence file. stats: - Show some statistics about a sequence or annotation file. fill-scaffolds: - Guess what might be in the gaps in a 454 scaffold. pastiche: - Use MUMMER to plaster a set of contigs over reference sequences. changes: - Prints out change log file. Pipeline tools: analyse-sample: - Clip, align, and call consensus on a set of reads. analyse-variants: - Produce a VCF file listing SNPs and other variants in a set of samples. analyse-expression: - Count alignments of fragments to genes, then perform various types of statistics and visualization on this. analyse-samples: - Run "analyse-sample:" on a set of different samples, then run "analyse-variants:" and/or "analyse-expression". If a pipeline tool is run again, it restarts only from the point affected by the changed parameters. The following global flags control pipeline tool behaviour: --make-cores 64 # Approximate number of cores to use. --make-do '' # Force this selection of tool names to be recomputed. # Examples: --make-do all --make-do analyse-samples/analyse-sample --make-done '' # Mark this selection of tool names as done without recomputing # them, if they would be recomputed. --make-show no # Show the first actions that would be made (other than those # specified by "--make-do"), then abort. --make-address 127.0.1.1 # IP address of the network interface you want the job manager to # listen to. --make-job '__command__ &' # Command to launch a new python. Should either contain # __command__, which will be subtituted with the full shell # command, including the job name, or __token__ and __jobname__, # which should be used in something like "python -m nesoni.legion # __token__ __jobname__". --make-kill 'pkill -f __jobname__' # Command to kill all processes identified by __jobname__. Input files: - sequence files can be in FASTA, FASTQ, or GENBANK format. - annotation files can be in GENBANK or GFF format (GFF is not yet supported by all tools). - nesoni is able to read files compressed with gzip or bzip2. Selections and sorts: Working directories can be given a set of tags using "tag:". They also implicitly have a tag for the name of the directory, and a tag "all". A selection expression is a logical expression used to select a subset of working directories. It may consist of (grouped by precedence): tag - Working directories with tag [exp] - exp -exp - not exp exp1:exp2 - exp1 and exp2 exp1/exp2 - exp1 or exp2 exp1^expr2 - exp1 xor exp2 Example: [strain1:time1]/[-strain1:-time1] - Samples either from strain1 at time1, or not from strain1 and not from time1. Equivalently: strain1^-time1 A sort expression is a comma separated list of selection expressions, used to sort a list of working directories. Example: strain1,strain2,time1,time2,time3,replicate1,replicate2 - Sort, grouping by strain, then by time, then by replicate
Requirements ============== Python 2.7 or higher. Use of PyPy where possible is highly recommended for performance. Python libraries * Recommended: * BioPython  * Optional (used by non-core nesoni tools): * matplotlib * numpy External programs: * Required: * SHRiMP or Bowtie2 * samtools * Required for VCF based variant calling: * Picard  * Freebayes * Optional for VCF based variant calling: * SplitsTree4 R libraries required by R-based tools (mostly for RNA-seq): * Required: * limma, edgeR from BioConductor * seriation * Optional: * goseq from BioConductor * NMF  BioPython is used for reading GenBank files. Compiled modules may need to be disabled when installing in PyPy.  There does not seem to be a standard way to install .jar files. Nesoni will search for .jar files in directories listed in environment variables $PATH and $JARPATH. Installation ============== The easy way to install or upgrade: pip install -I nesoni Then type "nesoni" and follow the command to install the R module. See below for more ways to install nesoni. Advanced Installation ----------------------- From source, download and untar the source tarball, then: python setup.py install Optional: R CMD INSTALL nesoni/nesoni-r For PyPy it seems to be currently easiest to install nesoni in a virtualenv: virtualenv -p pypy my-pypy-env my-pypy-env/bin/pip install -I biopython my-pypy-env/bin/pip install -I nesoni You can also set up a CPython virtualenv like this: virtualenv my-python-env my-python-env/bin/pip install -I numpy my-python-env/bin/pip install -I matplotlib my-python-env/bin/pip install -I biopython my-python-env/bin/pip install -I nesoni Installing older versions --------------------------- The interface and API for nesoni may change between versions (I try to keep this to a minimum). In order to run a script or python program that needs an older version, I suggest setting up a virtualenv. For example, if you want version 0.95: virtualenv -p pypy my-old-env my-old-env/bin/pip install -I biopython my-old-env/bin/pip install -I nesoni==0.95 (Note: I don't have a neat way to make this work with the R components of nesoni.)