RBPmotif web server

Help topics




General remarks

RNA-binding proteins (RBPs) are key regulators of post-transcriptional regulation, however only a small number of these proteins have characterized binding preferences. Recent experimental studies have enabled the identification of RNA targets associated with specific RBPs. However, because the locations of the binding sites within the targets are unknown and because RBPs recognize both sequence and structure elements in their binding sites, identification of RBP binding preferences from these data remains challenging.
RNAcontext [1] has been proposed to address this challenge by using a novel representation of secondary structure to infer both sequence and structure preferences of RBPs. RBPmotif web server implements the RNAcontext algorithm and an additional analysis framework so that binding preferences of an RBP of interest can be identified from experimental data.

RBPmotif web server runs in two modes:
  • de novo motif finding with RNAcontext when there is no a priori knowledge about RBP binding sites
  • analysis of a previously identified motif for enrichment in a particular secondary structure context

Instructions

The following steps are common for both modes:
  • Enter your email address (optional): RNAcontext can take a long time to finish when the number of input sequences is large. If you provide an email address we will send you an email that contains a link where you can access your results.
  • Enter input sequences: You can either paste the sequences (in FASTA format) to the textboxes or upload FASTA files. Note that RNAcontext has to see many examples (both bound and unbound) in order to infer binding preferences accurately.
  • Choose between different representations of secondary structure: Allowed options are PU, PLE, PHTE, PHIME. Please see the related section (Representation of secondary structure) for a description of these options.
  • Choose between global and local folding for secondary structure prediction: global considers the entire sequence at once when folding. local folding averages results from folding the sequence in windows. If local mode is selected, users also have to choose the window length (specified by RNAplfold parameter -W) and the maximum base pair span (specified by RNAplfold parameter -L) .
For de novo motif finding:
  • Choose a range for the motif length (between 4 and 12).
For structure analysis:
  • Enter the motif in IUPAC format. The length of the motif can be between 4 and 12.

Limits

The input sequences should conform to the following limitations:
  • The number of sequences: In order to make sure the predictions are statistically reliable we set the minimum number of sequences for bound or unbound set to be 50. The maximum number of allowed bound or unbound sequences is 1000 for each type of analysis.
  • The length of sequences: The length of a sequence needs to be longer than the maximum motif length. The maximum length of a sequence can be 1500nts.
If your input data is larger than the limits of the RBPmotif web server, you can download the standalone version of RNAcontext from here.


Secondary structure prediction

We predict the secondary structures of input sequences using an existing algorithm called RNAplfold [2]. RNAplfold considers the ensemble of all possible structures of an RNA sequence to calculate probabilities for each base to be in various structural contexts (e.g. hairpin loop, external loop). We modified RNAplfold so that instead of outputting the accessibility (which can be interpreted as probability that the region of interest is single-stranded), it outputs the probabilities for the region of interest to be in each of the four possible single-stranded contexts (i.e., hairpin loop, internal loop (aka bulge), multiloop, or external loop); these four probabilities sum up to the original accessibility. Then using the RNAplfold output, for each sequence, we compute a matrix (which we call the secondary structure profile) where rows represent the pairedness and the four single stranded RNA (ssRNA) structural contexts (i.e., hairpin loop, internal loop, multiloop, external loop) and columns correspond to the positions of the sequence. Each entry of this matrix represents the probability of a base to appear in a particular structural context, and the probabilities within a column sum up to 1.

Representation of secondary structure
Users can choose between four levels of secondary structure representation:
  • PU considers two contexts: paired (P) and unpaired (U).
  • PLE considers three contexts: paired (P), the union of hairpin loop, internal loop and multiloop (L) and external loop (E).
  • PHTE considers four context: paired (P), hairpin loop (H), the union of internal and multiloops (T) and external loop (E).
  • PHIME considers five different structural contexts: paired (P), hairpin loop (H), internal loop (I), multiloop (M), external loop (E).

Global or local folding?
Users can choose between global or local folding when predicting secondary structure. Global folding considers the entire sequence at once, whereas local folding divides the sequence into overlapping short windows and aggregates the results from folding each window separately. Based on our previous experiments and a separate study [3], we recommend choosing the local folding option (especially for long RNA sequences). If local folding is chosen, users have to specify two additional parameters that are described below.

RNAplfold parameters
We run RNAplfold with the option -u 1 (length of the subsequence for which the probabilities are calculated). This option is fixed since RNAcontext's secondary structure model requires single nucleotide profiles. Other parameters that need to be specified are the window length (-W) and maximum base pair span (-L). If global folding option is chosen, we automatically set -W and -L to the length of the sequence. However, in order to avoid long running times, we switch to local folding (with default parameters -W=200 and -L=150) if the sequence is longer than 1000 nts. If local folding is chosen, users require the specify the -W and -L arguments.

Search against known binding preferences

Once we predict the binding preferences with RNAcontext (de novo motif finding), we scan the sequence motif against known RBP motifs to identify RBPs with similar sequence preferences. We use an existing program called TomTom [5] with arguments -dist Pearson(Pearson Correlation-Coefficient) as the distance metric, -align SWU (Smith-Waterman, ungapped & alignments are extended) for calculating motif similarities. We compiled motifs from RBPDB database [6] and RNAcompete compendium [7]. The top 5 similar motifs are identified and displayed as a table which also shows the corresponding RBP, gene, p-value and q-values (output by TomTom) and link to the original database entry.

IUPAC motifs

You can find more information about IUPAC code here. For example GGYMA is a motif where Y stands for C and U, and M stands for A and C. This motif will match the following sequences: GGCAA, GGCCA, GGUAA, GGUCA.


Structure analysis

We scan the bound and unbound sequences with the IUPAC motif and identify the occurrences of the motif. We then extract the profiles of these motif occurrences and investigate whether the RBP of interest displays a specific secondary structure preference. To do this, we compare the profiles of the matches between bound sequences and unbound sequences. To summarize the structure profile of an occurrence, for each structural context, we calculate the average probability across the positions of the motif. If PU option is selected, we represent the structural context of the motif occurrence by two values i.e., average probability of being in a paired region or unpaired region. If PLE option is selected, we represent the structural context of a motif with three values i.e., being in paired region, being in hairpin, internal or multiloop, and being in external loop. If PHIME option is selected, we represent the structural context of a motif occurrence with five values i.e., average probability of being in a paired region, in a hairpin loop, in an internal loop, in a multiloop, and in external region. Once we calculate these values for each motif occurrence, we plot the mean and standard error of the distributions of these values within occurrences in bound and unbound sequences. We then use the Wilcoxon ranksum test to check whether the difference between the two distributions is significant for each structural context. We report the result of this test together with the Bonforreni corrected p-value. A recent application of this analysis on LIN28 CLIP data can be found in [4].


Interpreting the results

  • De novo motif finding

    RNAcontext fits two sets of parameters: sequence parameters and structure parameters. Sequence parameters are converted into a Position Frequency Matrix (PFM) and displayed with a frequency motif logo that is created using the EnoLOGOS software [8]. You can check EnoLOGOS's documentation for more information on frequency logos. Basically, the height of a symbol in a position is equal to the corresponding entry in the PFM. Optimal motif length is determined by cross-validation area under the ROC curve. Namely, the motif length that gives the highest cross-validation AU-ROC is selected.



    Position A C G U
    1 0.015 0.9 0.084 0.001
    2 0.073 0.020 0 0.907
    3 0 0.033 0.96 0.007
    4 0.016 0.047 0.934 0.003

        Example logo                 Position Frequency Matrix


    Secondary structure parameters are scaled such that the most preferred context is equal to 1. The bar graph displays the data shown in the table below. In this example, the RBP of interest is predicted to prefer hairpin loops the most. In addition, paired and multiloop regions are preferred over internal and external loop.


    P H I M E
    0.702 1.0 0.052 0.601 0.140















    In order to assess the predictive performance of the learned model, we apply the 5-fold cross validation scheme, and calculate area under the ROC curve for each of the five runs. To compare the added predictive value of structure preferences, we score the sequences in two ways: using only the sequence parameters and using both the sequence and structure parameters. Also, we calculate p-values of the AU-ROCs as described in [9]. The table below shows an example of this analysis.


    As a follow-up, we check existing databases of RBP binding sites to identify RBPs with similar motifs to the predicted sequence motif. We use the TomTom tool with parameters -thresh 0.9 -dist Pearson and display the top 5 similar RBPs. The table below shows the results of an example run.


  • Structure analysis
Results of the structure analysis are displayed with an error bar graph. This graph shows, for each structural context, the mean and standard error of the distribution of profiles that correspond to motif occurrences in bound and unbound sequences. Wilcoxon rank sum test is used to identify the structural contexts for which there is a statistically significant difference between occurrences in bound and unbound sequences.

If you have any questions please send an email to hilalkazan at gmail dot com.

References


[1] Kazan H, Ray D, Chan ET, Hughes TR, Morris Q (2010) RNAcontext: A new method for learning the sequence and structure binding preferences of RNA-Binding Proteins. PLoS Comput Biol 6(7): e1000832. doi:10.1371/journal.pcbi.1000832
[2] Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S Tacker M, et al. (1994) Fast folding and comparison of RNA secondary structures. Monatsh Chem 125:167-188 .
[3] Lange SJ, Maticzka D, Mohl M, Gagnon JN, Brown CM et al. (2012) Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic Acids Res 10.1093/nar/gks181.
[4] Wilbert ML, Huelga SC, Kapelli K, Stark TJ, Liang TY et al.(2012) LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance. Mol Cell 48(2):195-206.
[5] Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS (2009) Quantifying similarity between motifs Genome Biol 8:R24
[6] Cook KB, Kazan H, Zuberi K, Morris W, Hughes TR. (2011) RBPDB: a database of RNA-binding specificities Nuc Acids Res 39: D301-D308.
[7] Ray D et al. (2013) A compendium of RNA-binding motifs for analysis of gene regulation. Nature (in press)
[8] Workman CT, Yin Y, DL Corcoran, Ideker T, Stormo GD, Benos PV. (2005) enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucleic Acids Res 33:W389-W392.
[9] Hanley JA, McNeil BJ (1982) The meaning and use of a receiver operating characteristic (ROC) curve Radiology 143:29-36