BS Seeker performs accurate and fast mapping of bisulfite-treated short reads. It is able to align reads generated from the Cokus et al's library protocol (with tags) or the Lister et al's library protocol (with no tags). Among the software tested, BS Seeker has the highest accuracy, the least biased mapping, and the greatest coverage in Arabidopsis mapping. It is also the only bisulfite aligner feasible for mapping large genomes such as human.


SYSTEM REQUIREMENTS


1. Linux or Mac OS platform

2. Bowtie aligner (Version 0.10.0 or above; Bowtie is freely available at http://bowtie-bio.sourceforge.net )

3. Python (version 2.5.2 or above, it is normally pre-installed in Linux. Type " python " to see the installed version. Python is also freely available at http://www.python.org/download/ )


RUNNING BS SEEKER


Running BS Seeker is a 2-step process, 1) pre-processing the genome and 2) mapping and post-processing. Pre-processing the reference genome is required only once, and the running time highly depends on the size of the genome. The second step converts the genome to a three-letter alphabet and uses Bowtie to align bisulfite-treated reads and to a reference genome (multiple processes are used), then removes non-unique and incorrectly converted mappings.


INPUT FILES


1. BS reads file in Solexa _seq file:

Solexa _seq file format;

   

7       175     685     940     TCCATTATACCGTAACCCAATACAAAAATTATTTAT

7       175     610     214     TCTGTAGACGGGTCGAATGGGGAGTTCATAGGGGGG



or  a simple list of sequences;

   

TCCATTATACCGTAACCCAATACAAAAATTATTTAT

TCTGTAGACGGGTCGAATGGGGAGTTCATAGGGGGG



or  Illumina fastq;

   

@SRR019072.2842 HWI-EAS365_1060:4:1:51:313 length=87

TAATTAGATTTGTGTTATAGATTATTTGTAAAGAAAGTAATTATTAAAGGAAATGTTAGTTTTTATTTGATATATGATAAGAGAACG

+SRR019072.2842 HWI-EAS365_1060:4:1:51:313 length=87

BBBCC@)8ABA/<2>CB:=.:?BBABB1-:@74@B@?=@@ABB@B7@@5/98<;)<>56:?>:;A?A?A@>=AABB@A<3(@@=086



or Solexa qseq file;

   

HWI-EAS399      1       5       1       80      1545    0       1       TCTGTTGGGAGGGAAAGTGGAAGAGTGAGTAGATTAGTTGGATAGGAGTGAGGAGTTGGAGGAAGGGATGGGGTA     BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB     1  

   


or fasta format.

   

>read1

TCCATTATACCGTAACCCAATACAAAAATTATTTAT

>read2

TCTGTAGACGGGTCGAATGGGGAGTTCATAGGGGGG



  1. 2.Reference genome in fasta format

Note that 4-digit serial IDs are uniquely assigned to all reference sequences, to avoid long names. See "log_Preprocessing_genome.txt" or your_reference_file+”.log” generated from preprocessing genome for the assignments. 


USAGE:


Step 1:

Usage: python Preprocessing_genome.py [options] > log_preprocessing.txt

Options [default]:   

    -h,    show this help message and exit

    -f    Input your reference genome filename (fasta format)   

    -t     Reads containing tags? Y/N  [Y]

    -p    Path to Bowtie [~/bowtie-0.10.0/]


Step 2:

Usage: python BS_Seeker.py [options]

Options [default]:

    -h, --help            show this help message and exit   

    -i    Input your read file (support informat: Solexa seq, sequences, illumina fastq, qseq)

    -t    Reads containing tags? Y/N  [Y]

        If reads have tags:

        -f         FW tag [TCTGT]

        -r        RC tag [TACCT]

    -e    The last cycle number of your read to be mapped [36]

    -a    Specify a file containing a single line of the adapter sequence to be clipped from the 5’end of the reads, or, if tags are used, 2 lines of  the 3’-adapter sequences for FW and RC reads [null]

    -p    Path to Bowtie [~/bowtie-0.10.0/]

    -d    Path to Reference genome library [./reference_genome/]

    -m    Number of mismatches (0, 1, 2, 3) [2]

    -o    Output filename [BS_SEEKER_OUTPUT.txt]


OUTPUT FILES


1. LOG file

A log file is generated. It gives the information of the input BS reads, tag information if available, uniquely mapped reads, running time, etc.


2. Output format

The columns in the output is are:


Example:



  1. 1.Read ID (from the header columns in Solexa seq/fastq/qseq/fasta file, or a serial number of the original input)

  2. 2.Number of mismatches between the genomic seq and the BS read list in columns 6 and 7. The bisulfite converted sites between read Ts to genomic Cs are not included.

  3. 3.The strand which the read may be from (+FW, +RC, -RC, -FW)

  4. 4.The coordinate of the mapped position: the first 2 digits indicate the chromosome, the "+" or "-" indicate the mapped strand. The last 10 digits are the 0-based, 5'-end coordinate of the mapped genomic sequence on the Watson strand.

  5. 5.The genomic sequence of the mapped region plus +2 and -2 bps.

  6. 6.BS read sequences from 5' to 3': if the reads are uniquely mapped as they were FW reads, the original reads are shown. If the reads are uniquely mapped as they were RC reads, their reverse complements are shown. 

  7. 7.Summarised sequence of methylated sites: the methylated CG/CHG/CHH sites are marked as X/Y/Z (upper case), whereas the unmethylated CG/CHG/CHH sites are marked as x/y/z (lower case). This column is summarised directly from Columns 6 and 7.

  8. 8.Index=1 if three consecutive methylation non-CG sites appear. Index =0, otherwise.


Convert alignments to BAM format (BSSout2SAM.py)

Convert the alignment to SAM and BAM files.


Usage: BSSout2SAM.py [options]

Note: Python 2.6+, and two python modules, pysam and pyrex, are required.


Options:

  -h    show this help message and exit

  -f     alignment output from BS SEEKER

  -d    Path to Reference genome library (generated in preprocessing genome) [./reference_genome/]

  -r     Input your reference genome file (fasta)


Example:

Generate SAM (text) and BAM (binary) files:

“ python BSSout2SAM.py -d ./reference_genome/ -f myoutput.txt -r chr1.fa “

View BAM file

“ samtools view -h myoutput.txt.bam “


See SAMtools at http://samtools.sourceforge.net for more information.