Instructions

This web-interface gives the opportunity to change parameters for the BisAMP-analysis pipeline in a user-friendly way.
After all parameters are set, they are saved as a file called "expinfo.txt" which is loaded by a bash-script to run the analysis.
In the following sections the parameters and their influence on the pipeline are described.
Afterwards, the parameters are summarized in a table showing also the default values as well as the value-range per parameter.
Then, an example "expinfo.txt"-file is given which can be read by the pipeline script.
Finally, artificial files are given to test the pipeline.


1. Upload a zip-file (*.zip) containing de-multiplexed fastq-files from the MiSeq sequencing run
or upload one not de-multiplexed fastq file (*.fastq) containing all sequences.


This pipeline is created to analyze data presented as fastq- files. Those are identified by the file-suffix "*.fastq" or "*.fq". They can also be in gzipped format "*.fastq.gz".
If you have multiple samples per experiment, you should make sure that the samples are either distinguishable by sequencing barcodes, or that you have one or in case of paired end sequencing two sequencing files per sample.
In the latter case, all sequencing files should be included into a zip-archive. "*.zip". Please be aware, that you can only upload one zip archive or one fastq-file per analysis.

If you have a paired-end experiment, the two sequencing files per sample should be distinguishable by having "_R1" and "_R2" right before their file suffix. Other name strings cannot be handled and produce errors.
Please make sure to not exceed the maximum allowed file size of 500 GB.

2. Upload a metadata file (*.txt) containing the unconverted reference sequences assigned to each of the sequencing experiments.

As a second prerequisite for the analysis, please upload a metadata text file containing information ordered in columns. Importantly, one column should contain a unique identifier for the sequenced samples or a barcode to discriminate the sequenced samples in a multiplex fastq-file. In case your fastq-files are already de-multiplexed, the unique identifier should be a string, which can be used to discriminate the sequencing files.

Example:

Your PE-sequencing run comes with fastq-files for three samples named as:

AAAA_R1.fastq.gz; AAAA_R2.fastq.gz; BBBB_R1.fastq.gz; BBBB_R2.fastq.gz; CCCC_R1.fastq.gz; CCCC_R2.fastq.gz;

then, the column of the text file should contain

AAAA, BBBB and CCCC.

Another column in this file should contain for each sample a reference sequence in its unconverted state. Please be aware, that the final heatmaps will be produced according to that reference sequence. If you want to omit primer sequences, then remove those from your reference.
As an option, you can also add a column which will describe the x-axis label of the output heatmap. Those can be sample names or others.
Please also make sure that each column contains a header.
There is no mandatory column order, because you have to assign the columns in the parameter section by yourself.

Please find below an example of a structure of a metadata text-file for the analysis of a multiplex fastq-file.

Barcode Sample_Name Reference_Sequence
ATCG untreated_RNA1 ATATATTTTCCCTGTTAGCGTTACGCGTTCCCGAAGT
GGTA treated1_RNA1 ATATATTTTCCCTGTTAGCGTTACGCGTTCCCGAAGT
TAGC treated2_RNA1 ATATATTTTCCCTGTTAGCGTTACGCGTTCCCGAAGT
AATG untreated_RNA2 GGTGTATCGGACCTGACCGTTGACCATGCTAGGTCACACCTAAAGTTT
CCAT treated1_RNA2 GGTGTATCGGACCTGACCGTTGACCATGCTAGGTCACACCTAAAGTTT
GTCT treated2_RNA2 GGTGTATCGGACCTGACCGTTGACCATGCTAGGTCACACCTAAAGTTT


Please find below an example of a structure of a metadata text-file for the analysis of de-multiplexed files.

File_Name Sample_Name Reference_Sequence
AAAA untreated_RNA1 ATATATTTTCCCTGTTAGCGTTACGCGTTCCCGAAGT
BBBB treated1_RNA1 ATATATTTTCCCTGTTAGCGTTACGCGTTCCCGAAGT
CCCC treated2_RNA1 ATATATTTTCCCTGTTAGCGTTACGCGTTCCCGAAGT
DDDD untreated_RNA2 GGTGTATCGGACCTGACCGTTGACCATGCTAGGTCACACCTAAAGTTT
EEEE treated1_RNA2 GGTGTATCGGACCTGACCGTTGACCATGCTAGGTCACACCTAAAGTTT
FFFF treated2_RNA2 GGTGTATCGGACCTGACCGTTGACCATGCTAGGTCACACCTAAAGTTT


3. Define parameters for the analysis

After successful data and metadata file upload, you can assign the parameters of the pipeline in the forms below. In all cases the default parameters are given.

Define your sequencing data:

Using the two checkboxes, you can assign if your uploaded data is de-multiplexed (default) or if it still contains barcodes and need to be de-multiplexed prior to the analysis.

Define column in textfile containing the filenames (exluding the file suffices and excluding "_R1" and "_R2" in PE experiments):

This form element only appears, if you have selected that your data is de-multiplexed.

During file upload you provided a metadata text file.
In case your data is de-multiplexed, you have to assign here which of the columns contains the filenames excluding the file suffices as well as the SE/PE-read filename discriminators ("_R1" and "_R2").
Please see also the examples above.
In the "expinfo.txt"-file, this parameter is defined as "seqcol".

Define column in textfile containing the barcodes:

These form elements only appears, if you have selected to analyze data containing barcodes.

During file upload you provided a tab-separated text file.
In case you uploaded data which still needs to be de-multiplexed, you have to assign the column containing the barcodes.
In the "expinfo.txt"-file, this parameter is defined as "bc".

Additionally, you can decide, how many random bases upstream and downstream are allowed to form a string to detect the barcode in the read. Using the default parameters the reads are scanned and separated by the following string:

5'-NN-Barcode-NNNNNNNNNN-3'
In the "expinfo.txt"-file, those parameters are defined as "nbRandom5prime" and "nbAfterBarcode".

Furthermore you can choose, if duplicates in your data should be removed.
In the "expinfo.txt"-file, this parameter is defined as "rmDupl".

Define column in textfile containing the unconverted reference sequences:

During file upload you provided a metadata text file.
Please define here, which of the columns contains the reference sequence. Make sure, that the reference sequence is not bisulfite converted. Also be aware, that the final heatmap will depend on the reference sequence. If primer sequences should not be displayed, chop them from the reference.
In the "expinfo.txt"-file, this parameter is defined as "refcol".

Define column in textfile containing the sample name (will also be labeled on the x-axis):

During file upload you provided a metadata text file.
Please define here which of the columns contains the information which should be plotted on the x-axis of the heatmap.
In the "expinfo.txt"-file, this parameter is defined as "label".

Experiment type:

Please define here, if you have SE or PE-reads to analyze. In case you choose SE, only one fastq-file per sample is used for the analysis. In case you choose PE, please be sure to have two files uploaded per sample which are marked with "_R1" and "_R2" to mark the single and the paired reads.
Paired-end analysis needs approximately double the time of a single-end analysis, however, information of the paired-reads is used to complement the output heatmap and to perform quality control as well as correction of mismatched bases.
Analysis of PE-reads is not supported on barcoded files.
In the "expinfo.txt"-file, this parameter is defined as "exp".

Define which reads should be visualized:

This form element only appears, if you have selected to analyze your data as PE experiment.

During the PE analysis the single reads as well as the paired reads are subjected individually to bsmap for mapping them to the reference. Reads which pass the mapping are subjected to filtering of artifacts.
Here you can decide if the filtering and the heatmaps will be done using reads which pass in the single OR in the paired read or using only reads which pass in single AND paired read.
In the "expinfo.txt"-file, this parameter is defined as "PEQC".

Define cutoff for detection of unconverted reads:
With this parameter you can define the maximal number of adjacent Cs, before the read is considered as unconverted and discarded.
This is of importance in RNA, where secondary structures with C-stretches might interfere with the bisulfite conversion reaction.
In the "expinfo.txt"-file, this parameters is defined as Ccut.
Adapting mapping parameters in program bsmap:

To map the reads to the reference, the program bismap is used. Most of the parameters are set internally to achieve best results. However you can adapt some parameters to make the mapping more flexible.
Those include:

-v:

This parameter describes the mismatch rate which is considered. A mismatch rate larger than the indicated number leads to the removal of the read from the analysis.
A value between 0 and 1 are interpreted as fraction of mismatches with respect to the read length, e.g. 0.03 means up to 3 mismatches per 100 bases or up to 6 mismatches per 200 bases.
A value equal and larger then 1 are interpreted as integer numbers of mismatches, e.g. 3 means up to 3 mismatches allowed irrespective of the sequence length.

-w:

This parameter describes the maximal number of equal best hits. Its default is set to 1000.

-s:

With this parameter you can change the core number of nucleotides used to detect an alignment from its default value 12.

In the "expinfo.txt"-file, those parameters are defined as "map_v", "map_w" and "map_s".

Adapting parameters for filtering of artifacts:

After mapping the reads, the result is subjected to filtering of artifacts procedure. If you choose the option to change the default values, you can adapt the parameters of this step to your needs.

You can change the following parameters:

minLen:

This parameter describes the minimal sequence length. Reads below this length are discarded.

minQ:

This parameter describes the minimal sequence quality. Reads having a quality below the given value are discarded.

alpha:

With this parameter you can define the off target amplification cutoff alpha which is a Type I error in off target detection filtering.
If a mismatch co-occurs with an unconverted C, and this correlation has a p-value below alpha, the reads containing this mismatch are discarded.

In the "expinfo.txt"-file, those parameters are defined as "Minlen", "MinQ" and "alpha".

A to I editing or polymorphism:

In some RNAs A to I editing or other polymorphisms can occur on distinct positions. In the analysis those positions will create mismatches in the mapping procedure, which might lead to discarding the read. To avoid that we give here the possibility, to define for a certain set of samples a base position in the reference where mismatches should be ignored.

A-I_pos:

Here you can add the position within the reference sequence, where an A to I editing or other polymorphisms can occur.
The default value is set to 0, meaning that no position is should be considered. Please count the position starting with 1 being the first base in the reference.

A-I_name:

Because A to I editing or polymorphisms are not common for all RNA types, you can discriminate the samples to consider by giving a string which is present in the sample names.
Please be aware, that the string is scanned in the metadata column which you assigned to be displayed on the x-axis of the output heatmap. (see above)
Taking the example above, when you choose the column "Sample_Name" to be displayed at the heatmap, than a suitable string would be "RNA1" to only consider the first three samples for A to I editing. Using the string "RNA" would lead that all samples are considered for A to I editing.
In the "expinfo.txt"-file, those parameters are defined as "AIpos" and "AIname".

Adapting parameters for output heatmaps:

In the final part of the web interface, you can choose how the output heatmaps should look like.

Define what to plot:

Here you can choose if you want to plot the unconversion status of all Cs in the output heatmap or only the status of Cs in CpG content.
Although this pipeline is created to analyse RNA sequences, we give this option to visualize results coming from DNA amplicons, too.
In the "expinfo.txt"-file, this parameter is defined as "Cplot".

Define Colors for the Heatmap:

In each heatmap four colors represent four different states of a C after the analysis. In the default settings, those are:

yellow: Those Cs were converted during the bisulfite reaction.
blue: Those Cs remain Cs after bisulfite conversion. They are unconverted in the read.
red: Those Cs show a mismatch, meaning that another base than C or T was detected in the read at the respective position.
white: For those Cs there is no sequencing information present in the read.

Using the selection-fields you can change the colors for all states to create individual heatmaps. You can choose several standard colors, but you can also add any color in hexadecimal code. In that case please provide a 6-digit string without the preceding #.
In the "expinfo.txt"-file, this parameter is defined as "colall".

Small or large figure font sizes?:

Here you can choose, if you want to have small or large fonts within the heatmap.
In the "expinfo.txt"-file, this parameter is defined as "size".

Should the legend be shown?:

Here you can decide if you want the legend be drawn for each heatmap.
In the "expinfo.txt"-file, this parameter is defined as "legend".

How should the reads be sorted?

Please choose here, how you would like to sort the reads.
You have the options to decide on:

by unconversion rate:
The reads are sorted by the number of unconverted Cs per read. The algorithm does not take into consideration the position of the unconverted reads.

by hierarchical clustering:
The reads are hierarchically clustered.

no sorting:
The reads are not sorted at all.

In the "expinfo.txt"-file, this parameter is defined as "readsort".

4. Run the analysis.

When pressing the button "Run the Analysis", the web interface collects all values for each parameter and saves it into the file "expinfo.txt".
Afterwards, the pipeline is started by a bash script which calls the file and uses the parameter values to run the pipeline.
You can check the pipeline status by clicking the button "Check pipeline status".
The pipeline is performed for each sample subsequently. If one sample analysis is done, the heatmap is displayed in the web interface. (Automatic heatmap update is currently not supported by InternetExplorer. If you use InternetExplorer, simply refresh the page during the run to update the heatmaps.)
You can also examine the results during the run by clicking on the small heatmaps.
After the complete analysis is done, two buttons appear.
By pressing the button "Download Results" the pipeline results are collected in a zip archive which can be downloaded.
Pressing the button "Discard results for Re-analysis" discards every output and sets the web-service back to the state prior to the analysis. Now the analysis can be re-run using other parameters.
Finally the button "Start from the Beginning", which is present during the complete run can be used to start a new pipeline run.



Summarized table to show all parameters which are necessary to run the pipeline:

Parameter Function Possible Values Default Value
token A specific string to identify the analysis. This is set by the program automatically.
If you run the pipeline offline set this value to "offline".
In that case, please refer to the README.txt-file within the pipeline folder.
-string- ----
exp Define, if the analysis should be performed in singe-end mode (SE) or paired-end mode (PE). SE or PE SE
bc Define, if the seq files are de-multiplexed or if they contain barcodes in which column in auxiliary text file. 0 (de-multiplexed)
1-n (barcode in which column)
0
nbRandom5prime Define, how many random 5' bases are alowed in barcode detection. -integer- 2
nbAfterBarcode Define, how many random 3' bases are alowed in barcode detection. -integer- 10
rmDupl Define, if duplicates should be removed. 0 (no)
1 (yes)
0
PEQC Define, which reads to visualize in the heatmap in PE experiments after filtering 1 (all reads)
0 (reads survived in both files)
1
map_v bsmap parameter -v: mismatch rate -double- 0.03
map_w bsmap parameter -w: max. number of equal best hits -integer- 1000
map_s bsmap parameter -s: seed length -integer- 12
MinLen Final filtering parameter: minimal sequence length (bp) -integer- 25
MinQ Final filtering parameter: minimal sequence quality -integer- 10
Ccut Final filtering parameter: cutoff for detection of unconverted reads (experimental artifact) -integer- 3
alpha off target amplification cutoff value (Type I error) -double- 0.05
AIpos A to I transition: position in reference sequence -integer-
0 (not applied)
0
AIname A to I transition: string in sample name to identify affected samples -string-
NA (not applied)
NA
Cplot Heatmap parameter: Define if all Cs or only CpGs should be visualized. C(all Cs)
CG(CpGs)
C
colall Heatmap parameter: Define colors for heatmap production (col1_col2_col3_col4) col1_col2_col3_col4
col1: converted Cs
col2: unconverted Cs
col3: mismatch
col4: no match
all variables either in group
(yellow, blue, red, green, white, black)
or 6 digit integer (hex-code).
yellow_blue_red_white
readsort Heatmap parameter: Define, how reads should be sorted n(no sorting)
m(sorting by unconversion rate)
h(sorting by hierarchical clustering)
m
label Define column describing the sample name in auxiliary text file (also x-axis label in heatmap). -integer- ---
seqcol Define column describing the sequencing id in auxiliary text file. -integer- ---
refcol Define column describing the reference sequence in auxiliary text file. -integer- ---
legend Heatmap parameter: Define if the legend should be shown. y(legend should be shown)
n(omit legend)
y
size Heatmap parameter: Define the font size in the heatmap. s(small font sizes)
l(large font sizes)
s



Example structure of an "expinfo.txt" file


token: offline
exp: SE
bc: 0
nbRandom5prime: 2
nbAfterBarcode: 10
rmDupl: 0
PEQC: 1
map_v: 0.03
map_w: 1000
map_s: 12
Minlen: 25
MinQ: 10
Ccut: 3
alpha: 0.05
AIpos: 0
AIname:
Cplot: C
colall: white_yellow_blue_red
readsort: n
label: 1
seqcol: 2
refcol: 3
legend: y
size: s



Example data of tRNA_Asp in ES and ND cells as well as 28S rRNA of human Fibroblasts.

PE sequencing files (two Fastq-files in a zip-archive) and a corresponding metadata file
data.zip
metadata.txt

Additional artificial files to test the analysis pipeline.

PE sequencing files (two Fastq-files in a zip-archive) and a corresponding metadata file
PE_example.zip
PE_metadata.txt


Barcoded sequencing file and a corresponding metadata file
BC_example.fastq.gz
BC_metadata.txt