RNA-binding proteins (RBPs) play a crucial role in gene regulation. Unfortunately, experimental approaches for detecting the RNA-protein binding sites on RNAs are still high-cost and time-consuming. Computer algorithms are therefore wanted for automatic detection of the binding sites from sequences. This usually requires efficient representations of sequences of various length into vectors of the same size. For example, the k-mers representation as one-hot encoding is a simple widely used approach in the RBP binding sites prediction field. However, the k-mer feature representation will lead to extremely high-dimensional and sparse problems. Furthermore, the k-mer feature representation ignores the positional information within the sequences, which could negatively impact its predictive power. In this study, we present a deep learning based approach iDeepV. It first applies an unsupervised shallow two-layer neural network to automatically learn the distributed representation of k-mers by considering their neighbor context. Compared to the conventional k-mers approach, the new distributed representation captures the latent relationship of k-mers, in which the similarity between k-mers is taken into consideration. Then, the learned distributed representations of the input sequences are used as inputs for a convolutional neural network (CNN) to discriminate the RBP bound sites from the unbound sites. We comprehensively evaluate the iDeepV on two large-scale RBP binding sites datasets. The results show that iDeepV can yield comparable performance than the state-of-the-art methods. The iDeepV algorithm is available at https://github.com/xypan1232/iDeepV.

, , ,
doi.org/10.1016/j.neucom.2018.04.036, hdl.handle.net/1765/106493
Neurocomputing
Department of Medical Informatics

Pan, X., & Shen, H.-B. (2018). Learning distributed representations of RNA sequences and its application for predicting RNA-protein binding sites with a convolutional neural network. Neurocomputing, 305, 51–58. doi:10.1016/j.neucom.2018.04.036