Pipeline flow¶

1. Prepare miRNAs for pipeline¶

Prepare miRNA fasta file for MIRZA i.e. for each miRNA sequence check if it is 21 nucleotide long and if not eliminate it. It also replaces all u or U into T. In the same time it splits miRNAs into separate files.

usage: rg_prepare_mirnas_for_mirza_and_split [-h] [-v] --input INPUT
                                             [--output-dir OUTPUT_DIR]

Options:

`-v=False, --verbose=False`
	Be loud!
`--input`	Input miRNA file in fasta format.
`--output-dir=Output`
	Directory for split files, defaults to Output

2. Chunk UTRs¶

Take UTRs and generate fragments by sliding window that can be fed into MIRZA

usage: rg_generate_utr_chunks [-h] [-v] [--input INPUT] --output-dir
                              OUTPUT_DIR [--part-size PART_SIZE]
                              [--window-size WINDOW_SIZE]
                              [--slide-size SLIDE_SIZE]

Options:

`-v=False, --verbose=False`
	Be loud!
`--input=<open file '<stdin>', mode 'r' at 0x7fbdaf2b40c0>`
	Input file in fasta format. Defaults to sys.stdin.
`--output-dir`	Output directory for split files
`--part-size=40000`
	Number of sequences per part, defaults to 40000
`--window-size=50`
	Length of the window for MIRZA, defaults to 50
`--slide-size=20`
	Size of the window slide, defaults to 20

3. Generate miRNA expressions¶

Generate expressions of miRNAs. This file is required by MIRZA and is composed just from the miRNA ID and its expression. Here we set up expression for each miRNA to 1.

There is no special script dedicated to this function. It is just the bash command:

cat input | ruby -ne 'puts "#{$_.rstrip()[1..-1]}\t1" if $_.start_with?(">")' > output

4a. Run MIRZA analysis¶

If the option “scan” as a miRNA target search is chosen putative targets are selected based on MIRZA interaction energy. Thus, this is the part of the pipeline scans the 3’UTRs with MIRZA to find it.

This script generates the jobs of the pipeline that will do the analysis on the split files.

usage: run_mirza_scan [-h] [-v] --config CONFIG --group-id GROUP_ID
                      --input-dir INPUT_DIR --working-dir WORKING_DIR

Options:

`-v=False, --verbose=False`
	Be loud!
`--config`	Config file
`--group-id`	Group Id
`--input-dir`	Input and output directory
`--working-dir`	Working directory of the pipeline. Required because this file is launched from ~

I. Calculate coordinates with MIRZA¶

Here the MIRZA algorithm is used to calculate the energy between miRNA and the 3’UTR fragments generated before. MIRZA is launched from the command:

MIRZA expressions.tab mrnas.fa mirnas.fa 50 noupdate

And the result is piped to the analysis script:

Take MIRZA output and arrange it in proper way

usage: rg_extract_data_from_mirza_output [-h] [-v] [--output OUTPUT] --seqs
                                         SEQS --threshold THRESHOLD
                                         [--context CONTEXT]

Options:

`-v=False, --verbose=False`
	Be loud!
`--output`	Output file in Tab format.
`--seqs`	UTR sequences in fasta format
`--threshold`	Threshold for the score
`--context=50`	Context for sequence to print, defaults to 50

II. Merge and filter results¶

The results are merged with bash command and duplicate results resulting from eg. overlapping 3’UTR fragments or transcripts from the same gene.

Filter duplicates in coordinates by id, miRNA and sequence

usage: rg_filter_duplicates_from_scan [-h] [-v] --coords COORDS --split-by
                                      SPLIT_BY --index-after-split
                                      INDEX_AFTER_SPLIT --output OUTPUT

Options:

`-v=False, --verbose=False`
	Be loud!
`--coords`	Coordinates
`--split-by`	Split id by the string
`--index-after-split`
	After split take this column as new id, 0 based
`--output`	Output name

4b. Run seed scan¶

This part is launched if the “seed” options is chosen when starting the pipeline. The putative miRNA targets are found by simple seed match.

Count miRNA seed in the provided sequences according to chosen definition

usage: rg_count_miRNA_seeds_and_filter_duplicates [-h] [-v] [--motifs MOTIFS]
                                                  [--seqs SEQS]
                                                  [--how {ElMMo,TargetScan,6-mer}]
                                                  [--output OUTPUT]
                                                  [--context CONTEXT]
                                                  --split-by SPLIT_BY
                                                  --index-after-split
                                                  INDEX_AFTER_SPLIT

Options:

`-v=False, --verbose=False`
	Be loud!
`--motifs`	miRNA/siRNA sequences to use when scanning
`--seqs`	Sequences for scaning eg. 3’ UTRs
`--how=TargetScan`
	What definition for seed to use Possible choices: ElMMo, TargetScan, 6-mer
`--output=coords.tab`
	Name of the output file , defaults to coords.tab
`--context=50`	Context for sequence to print, defaults to 50
`--split-by`	Split id by the string
`--index-after-split`
	After split take this column as new id, 0 based

5. Run features analysis¶

This is the main part of the pipeline where all the features are calculated.

I. Calculate MIRZA¶

Calculate MIRZA interaction energy and MIRZA-based Branch Length Score (conservation) for the provided coordinates of putative targets.

usage: rg_calculate_MIRZA [-h] [-v] [--onlyMIRZA {yes,no}] [--seq SEQ]
                          [--out OUT] [--coords COORDS] [--motifs MOTIFS]
                          [--tree TREE] [--mln-dir MLN_DIR] [--threshold THR]
                          [--contextLen CONTEXTLEN] [--mirzabin MIRZABIN]
                          [--reforg REFORG]

Options:

`-v=False, --verbose=False`
	Be loud!
`--onlyMIRZA=no`	Calculate only MIRZA score for given coordinates Possible choices: yes, no
`--seq=seqs.fa`	Fasta with mRNA sequences
`--out=output.tab, -o=output.tab`
	Output table
`--coords=coords.tab`
	File with target sites positions, miRNA and target gene ID
`--motifs=`	Fasta file with miRNA sequences
`--tree=`	Phylogenetic tree of the species used in alignment file
`--mln-dir=`	Directory with multiple alignment files
`--threshold=20.0`
	Threshold for MIRZA score
`--contextLen=50`
	Length of the context sequence serounding binding site
`--mirzabin=`	Path to the MIRZA binary
`--reforg=hg19`	Reference organism to which alignments are performed: default: hg19

II. Calculate accessibility with CONTRAfold¶

A CONTRAfold algorithm is used to calculate accessibility of target site for miRNA.

usage: rg_calculate_contrafold [-h] [-v] [--seq SEQ] [--out OUT]
                               [--coords COORDS] [--contextLen_L CONTEXTLEN_L]
                               [--contextLen_U CONTEXTLEN_U]
                               [--context CONTEXT] [--contrabin CONTRABIN]

Options:

`-v=False, --verbose=False`
	Be loud!
`--seq=seqs.fa`	Fasta file with mRNA sequences , defaults to seqs.fa
`--out=output.tab.gz`
	output file, defaults to output.tab.gz
`--coords=coords.tab`
	file with target sites positions, miRNA and target gene ID
`--contextLen_L=0`
	length of the context sequence downstream binding site to be unwinded
`--contextLen_U=0`
	length of the context sequence upstream binding site to be unwinded
`--context=50`	length of the context of the seed to be checked
`--contrabin=contrafold`
	Path to CONTRAfold binary

III. Calculate flanks composition¶

Calculate flanks composition using provided coordinates.

usage: rg_calculate_flanks_composition [-h] [-v] [--seq SEQ] [--out OUT]
                                       [--coords COORDS]
                                       [--contextLen CONTEXTLEN]

Options:

`-v=False, --verbose=False`
	Be loud!
`--seq=seqs.fa`	fasta with mRNA sequences
`--out=output.tab`
	output table
`--coords=coords.tab`
	file with target sites positions, miRNA and target gene ID
`--contextLen=50`
	length of the context sequence serounding binding site

IV. Calculate distance to the boundary¶

Calculate distance to the boundary based on provided coordinates for targets.

usage: rg_calculate_distance [-h] [-v] [--seq SEQ] [--out OUT]
                             [--coords COORDS] [--contextLen CONTEXTLEN]

Options:

`-v=False, --verbose=False`
	Be loud!
`--seq=seqs.fa`	fasta with mRNA sequences
`--out=output.tab, -o=output.tab`
	output table
`--coords=coords.tab`
	file with target sites positions, miRNA and target gene ID
`--contextLen=50`
	length of the context sequence serounding binding site

6. Merge and add probabilities¶

Merge all results into one features table

usage: rg_merge_results_add_probability_and_calculate_per_gene_score
       [-h] [-v] [--inputs INPUTS] [--coords COORDS] [--model-bls MODEL_BLS]
       [--model-nobls MODEL_NOBLS] [--output OUTPUT] [--only-mirza {yes,no}]
       [--threshold THRESHOLD] [--split-by SPLIT_BY] [--column COLUMN]
       [--name NAME]

Options:

`-v=False, --verbose=False`
	Be loud!
`--inputs`	Coma-separated list of input paths
`--coords`	Cooridinate file used in the beginning
`--model-bls`	Path to model with branch length score
`--model-nobls`	Path to model without branch length score
`--output`	Output file name
`--only-mirza`	Calculate only MIRZA and DON’T calculate MIRZA BLS Possible choices: yes, no
`--threshold=0.12`
	Threshold for summing, defaults to 0.12
`--split-by=NONE`
	If the header of fasta has multiple annotations eg. transcript_id\|entrez_id\|wikiname split it and take only one, defaults to NONE
`--column=0`	0 based column number to take after splitting, defaults to 0
`--name=GeneID`	Name of the id eg. gene, transcript etc , defaults to GeneID

7. Collect results¶

In the end all the results are collected with bash command:

zcat output_dir/\*.score > cwd/mirza_g_results_protocol.tab