Deep learning algorithms for the prediction of non-coding RNAs in bladder cancer

When:
17/05/2021 – 18/05/2021 all-day
2021-05-17T02:00:00+02:00
2021-05-18T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : Laboratoire IBISC, Université d’Evry / Université
Durée : 3 ans
Contact : fariza.tahi@univ-evry.fr
Date limite de publication : 2021-05-17

Contexte :
Développement de méthodes computationnelles pour l’étude des ARN non-codants impliqués dans le cancer de vessie
In recent years, machine learning methods, particularly deep learning, have grown considerably, and have shown their effectiveness in a large number of fields, including biology and medicine. More and more bioinformatic methods and tools based on deep learning are proposed in the literature, to answer various biological and biomedical questions. In this project, we want to propose new deep learning methods for the prediction and analysis of particular genomic sequences, the non-coding RNAs (ncRNAs), in a biomedical context: their involvement in bladder cancer.
Non-coding RNAs, RNAs which do not code for proteins and constitute the largest part of genomes, are increasingly identified as playing important roles in the deregulation processes leading to pathologies, such as cancer (Anastasiadou et al., 2018). They are thus considered as potential diagnostic markers and therapeutic targets. Their identification and the determination of their function are important issues, and with the next generation sequencing (NGS) which generate considerable volumes of omics data, their prediction and their characterization by in silico methods is essential to make it possible to orient the experimental studies.
Recently, long ncRNAs (lncRNAs), larger than 200 nucleotides, have been identified as potential regulators. But unlike small ncRNAs, their characterization and classification by structure and function are far from established. Determining the structure of a lncRNA is a difficult problem, both by experimental (crystallography, NMR) and bioinformatic methods. Determining its function is even more difficult, especially since unlike proteins, ncRNAs with similar functions often lack sequence homology (RNA sequences show compensatory mutations maintaining structural conservation). Attempts to classify lncRNAs have been proposed, based on different criteria: length of transcripts, location, association with genes encoding proteins. A summary of these classifications has been proposed in (St. Laurent et al., 2015). In (Kopp and Mendell, 2018), the authors suggest a study of lncRNAs according to their location, explaining that this is often linked to function. But a large majority of published works is dedicated to the study of a precise lncRNA. For instance, a recent study (Uroda et al., 2019) reveals the importance of the presence of a pseudoknot (particular motif of the secondary structure) in the mechanism of regulation of the MEG3 lncRNA in the biological pathway of p53, gene involved in many cancers.

From a computational point of view, a few methods have been proposed in the literature for the classification of well characterized ncRNAs whose structure is well known. These methods, based on supervised machine learning, often of deep Learning type, offer a model built on a dataset composed of 13 classes of small ncRNAs. We can cite RNAcon (Panwar et al., 2014) based on the model of random forests and nRC (Fiannaca et al., 2017) based on convolutional networks (CNN), where the secondary structure is used for classification; then more recently ncRDeep (Chantsalnyam et al., 2020) based on CNNs and ncRFP (Wang et al., 2020) based on recurrent neural networks (RNN), both considering only sequence characteristics. Very few methods are specifically interested in the classification of lncRNAs. For example, SEEKR (Kirk et al., 2018) uses the sequence, more precisely the profiles of k-mers, to group the transcripts which are most similar and form a functional class, using a clustering algorithm based on a Pearson correlation (unsupervised learning). LncADeep (Yang et al., 2018) uses a deep neural network (DNN) to identify interactions between lncRNAs and proteins, based on sequence and secondary structure. The tool then uses the annotation of proteins associated with a lncRNA to describe the biological functions in which it is potentially involved. Although these methods make it possible to specify the broad category of lncRNAs, they remain limited. In addition, the classes summarized in (St. Laurent et al., 2015) are not all identified by the existing tools. We believe that it might be possible to more finely classify lncRNAs by taking into account other characteristics.

In this project, we propose to develop original computational methods based on Deep Learning (DL) to predict, classify and identify the function of ncRNAs, including the lncRNAs, by integrating different characteristics: sequence, structure (especially secondary), genomic and chromosomal position, interaction with coding or non-coding genes, and genetic and epigenetic alterations. Two methodological challenges are to be considered: (i) making it possible to consider heterogeneous characteristics (multi-source approach); (ii) predicting known classes of ncRNAs while being able to predict new classes, and this by combining a supervised approach with an unsupervised approach. An important point that we also consider concerns the visualization part of the results, for a better understanding and interpretation by the user.

Sujet :
We are interested by Self-organizing maps (SOM), which are unsupervised neural network capable of grouping and visualizing large-scale data. Using an unsupervised competitive learning algorithm, this technique is able to produce a map, representing the input space, in which nearby data is located in regions close to the map. In order to represent heterogeneous sources, we will propose original multimodal approaches based on DL which would allow to merge the different data sources. Fusion can be performed using three main strategies (Ramachandram and Taylor, 2017): early fusion, joint fusion and late fusion. Early merging involves combining the input characteristics of different sources before using a single DL model. Joint fusion refers to the process of combining representations of inputs learned at the intermediate layers of different neural networks that represent modalities. Late fusion allows the decisions of several neural networks that process modalities to be combined to provide a final decision. We will be particularly interested in joint fusion for the classification of ncRNAs and the identification of their biological functions. To take into account the different heterogeneous sources, each data source will be processed by an adequate DL model, such as “Convolutional Neural Neworks” (CNNs), “Graph Neural Networks” (GNNs) and multi-layer perceptrons (MLPs), which will allow better extraction of high level features from this source. To allow the discovery of new classes, we will study the association of different rejection options (Geifman and El-Yaniv, 2019) to the multimodal model. The combination of this model with SOMs (Platon et al. 2018) will allow the visualization of new classes of ncRNAs. We will also be interested in identifying the data sources and the characteristics that led to the predictions (Platon et al. 2018bis). This will make it possible to explain the predictions and to discover new properties that could be associated with ncRNAs.
Résultats attendus
Application, objectives and interest in cancerology research
The deregulation of ncRNAs may participate in tumor progression but the ncRNAs involved and their roles remain poorly defined. Clinically, ncRNAs can be diagnostic markers and therapeutic targets (Roberts et al., 2020).
Cancer in a given tissue is a heterogeneous disease composed of several subtypes, each subtype being defined by a specific transcriptional program. Genetic and epigenetic alterations as well as genes involved in cancer must be studied taking into-account these subtypes. In this project we will focus on bladder cancers (Tran et al., 2021) and in particular on the papillary luminal subtype and of the basal subtype bladder cancers for which the partner team has been major contributor. The luminal papillary subtype cancers are well differentiated and present, in the majority of cases, activating genetic alterations of the gene coding for the receptor tyrosine kinase FGFR3 and activation of the nuclear receptor PPARG (Biton et al., 2014; Mahé et al., 2018; Rochel et al., 2019; Shi et al. 2020). The basal subtype is a particularly aggressive subtype (most deaths will occur within one year after diagnosis), poorly differentiated, and found not only in bladder cancers but in many other carcinomas (breast, pancreatic, lung cancers for example) (Rebouissou et al., 2014; Kamoun et al., 2019).
The goals in the project are:
1) To systematically identify and classify the ncRNAs present in tumors of the papillary luminal and basal subtypes and compare them to the ncRNAs present in normal urothelium in different physiological states: different stages of differentiation, development or healing. This will allow us to accurately compare tumor cells to normal cells.
2) To identify deregulated ncRNAs in tumor cells compared to normal cells.
3) To determine the genetic or epigenetic mechanisms of their deregulation.
4) To predict the biological functions in which these ncRNAs are involved and their roles (target genes in the case of small ncRNAs, sponge function, participation in complexes and/or regulation of transcription in the case of long ncRNAs).
For this purpose, we will use the computational tools that will be developed in this project as well as other RNA bioinformatics tools developed in the AROBAS team (and available on the EvryRNA platform (http://EvryRNA.ibisc.univ-evry.fr), such as RNANet (Becquey et al., 2020), Biorseo (Becquey et al., 2020), RCPred (Legendre et al, 2019), or IRSOM (Platon et al., 2018)), miRNAFold (Tav et al. 2016), miRboost (Tran et al. 2015), and also the ones proposed in the literature.
The project will initially take advantage of molecular data already available (acquired by the partner team or public data): transcriptomics on whole samples and on single cells, genomic alterations, mutations, DNA methylation, histone modifications, DNA accessibility (ATAC-seq), DNA conformation (Hi-C). These data will be associated with clinical and pathological data. During this project, data will be acquired to complement the single cell analyses to take into-account the spatial organization of the tumor (spatial transcriptomics).

The proposed work will make it possible to advance our knowledge of bladder cancer, a cancer for which the therapeutic options in the case of the most aggressive forms remain limited. Since the studied subtypes are found in other cancers, the biological results obtained will have a general scope in oncology. RNAs are promising therapeutic targets, they are also diagnostic markers which can be very specific. In this thesis project the aim will therefore be, in addition to the development of original deep learning algorithms and original bioinformatics methods dedicated to RNAs, to help, thanks to the methods that we will develop, in the analysis and understanding of a health issue, here the bladder cancer, for a better therapeutic response.

Références :
-Anastasiadou E, Jacob LS, Slack FJ. Non-coding RNA networks in cancer. Nat Rev Cancer. 2018 Jan;18(1):5-18. doi: 10.1038/nrc.2017.99. Epub 2017 Nov 24. PMID: 29170536.
-Becquey L, Angel E, Tahi F. RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures. Bioinformatics. 2020 Nov 2:btaa944. doi: 10.1093/bioinformatics/btaa944.
-Becquey L, Angel E, Tahi F. BiORSEO: a bi-objective method to predict RNA secondary structures with pseudoknots using RNA 3D modules. Bioinformatics. 2020 Apr 15;36(8):2451-2457. doi: 10.1093/bioinformatics/btz962. PMID: 31913439. .
-Biton A, Bernard-Pierrot I, Lou Y, Krucker C, Chapeaublanc E, Rubio-Pérez C, López-Bigas N, Kamoun A, Neuzillet Y, Gestraud P, Grieco L, Rebouissou S, de Reyniès A, Benhamou S, Lebret T, Southgate J, Barillot E, Allory Y, Zinovyev A, Radvanyi F. Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes. Cell Rep. 2014 Nov 20;9(4):1235-45. doi: 10.1016/j.celrep.2014.10.035. Epub 2014 Nov 13. PMID: 25456126.
-Chantsalnyam, T., Lim, D. Y., Tayara, H., & Chong, K. T. (2020). ncRDeep: Non-coding RNA classification with convolutional neural network. In Computational Biology and Chemistry (Vol. 88). Elsevier Ltd. https://doi.org/10.1016/j.compbiolchem.2020.107364
-Fiannaca, A., La Rosa, M., La Paglia, L., Rizzo, R., Urso, A. (2017). NRC: Non-coding RNA Classifier based on structural features. BioData Mining, 10(1), 27. https://doi.org/10.1186/s13040-017-0148-2
-Geifman, Y. & El-Yaniv, R.. (2019). SelectiveNet: A Deep Neural Network with an Integrated Reject Option. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2151-2159
-Kamoun A, de Reyniès A, Allory Y, Sjödahl G, Robertson AG, Seiler R, Hoadley KA, Groeneveld CS, Al-Ahmadie H, Choi W, Castro MAA, Fontugne J, Eriksson P, Mo Q, Kardos J, Zlotta A, Hartmann A, Dinney CP, Bellmunt J, Powles T, Malats N, Chan KS, Kim WY, McConkey DJ, Black PC, Dyrskjøt L, Höglund M, Lerner SP, Real FX, Radvanyi F; Bladder Cancer Molecular Taxonomy Group. A Consensus Molecular Classification of Muscle-invasive Bladder Cancer. Eur Urol. 2020 Apr;77(4):420-433. doi: 10.1016/j.eururo.2019.09.006. Epub 2019 Sep 26. PMID: 31563503.
-Kirk, J. M., Kim, S. O., Inoue, K., Smola, M. J., Lee, D. M., Schertzer, M. D., Wooten, J. S., Baker, A. R., Sprague, D., Collins, D. W., Horning, C. R., Wang, S., Chen, Q., Weeks, K. M., Mucha, P. J., Calabrese, J. M. (2018). Functional classification of long non-coding RNAs by k-mer content. Nature Genetics, 50(10), 1474–1482. https://doi.org/10.1038/s41588-018-0207-8
-Kopp, F., & Mendell, J. T. (2018). Functional Classification and Experimental Dissection of Long Noncoding RNAs. In Cell (Vol. 172, Issue 3, pp. 393–407). https://doi.org/10.1016/j.cell.2018.01.011
-Legendre A, Angel E, Tahi F. RCPred: RNA complex prediction as a constrained maximum weight clique problem. BMC Bioinformatics. 2019 Mar 29;20(Suppl 3):128. doi: 10.1186/s12859-019-2648-1
-Mahé M, Dufour F, Neyret-Kahn H, Moreno-Vega A, Beraud C, Shi M, Hamaidi I, Sanchez-Quiles V, Krucker C, Dorland-Galliot M, Chapeaublanc E, Nicolle R, Lang H, Pouponnot C, Massfelder T, Radvanyi F, Bernard-Pierrot I. An FGFR3/MYC positive feedback loop provides new opportunities for targeted therapies in bladder cancers. EMBO Mol Med. 2018 Apr;10(4):e8163. doi: 10.15252/emmm.201708163. PMID: 29463565.
– Pachera E, Assassi S, Salazar GA, Stellato M, Renoux F, Wunderlin A, Blyszczuk P, Lafyatis R, Kurreeman F, de Vries-Bouwstra J, Messemaker T, Feghali-Bostwick CA, Rogler G, van Haaften WT, Dijkstra G, Oakley F, Calcagni M, Schniering J, Maurer B, Distler JH, Kania G, Frank-Bertoncelj M, Distler O. Long noncoding RNA H19X is a key mediator of TGF-β-driven fibrosis. J Clin Invest. 2020 Sep 1;130(9):4888-4905. doi: 10.1172/JCI135439. PMID: 32603313
-Panwar, B., Arora, A., & Raghava, G. P. S. (2014). Prediction and classification of ncRNAs using structural information. BMC Genomics, 15(1), 127. https://doi.org/10.1186/1471-2164-15-127
-Platon L, Zehraoui F, Bendahmane A, Tahi F. IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing maps with rejection. Bioinformatics. 2018 Sep 1;34(17):i620-i628. doi: 10.1093/bioinformatics/bty572.
-Platon L, Zehraoui F, Tahi F. Localized Multiple Sources Self-Organizing Map. ICONIP (3) 2018: 648-659.
-Ramachandram D. and Taylor, G. W. ‘Deep Multimodal Learning: A Survey on Recent Advances and Trends,’ in IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 96-108, Nov. 2017, doi: 10.1109/MSP.2017.2738401.
-Rebouissou S, Bernard-Pierrot I, de Reyniès A, Lepage ML, Krucker C, Chapeaublanc E, Hérault A, Kamoun A, Caillault A, Letouzé E, Elarouci N, Neuzillet Y, Denoux Y, Molinié V, Vordos D, Laplanche A, Maillé P, Soyeux P, Ofualuka K, Reyal F, Biton A, Sibony M, Paoletti X, Southgate J, Benhamou S, Lebret T, Allory Y, Radvanyi F. EGFR as a potential therapeutic target for a subset of muscle-invasive bladder cancers presenting a basal-like phenotype. Sci Transl Med. 2014 Jul 9;6(244):244ra91. doi: 10.1126/scitranslmed.3008970. PMID: 25009231.
-Roberts TC, Langer R, Wood MJA. Advances in oligonucleotide drug delivery. Nat Rev Drug Discov. 2020 Oct;19(10):673-694. doi: 10.1038/s41573-020-0075-7. Epub 2020 Aug 11. PMID: 32782413
-Rochel N., Krucker C.*, Coutos-Thevenot L.*, Osz J., Zhang R., Guyon E., Zita W., Vanthong S., Alba Hernandez O., Bourguet M., Al Badawy K., Dufour F., Peluso-Iltis C., Heckler-Beji S., Dejaegere A., Kamoun A., de Reyniès A., Neuzillet Y., Rebouissou S., Béraud C., Lang H., Massfelder T., Allory Y., Cianférani S., Stote R.H., Radvanyi F., Bernard-Pierrot I. (2019). Recurrent activating mutations of PPAR associated with luminal bladder tumors. Nat. Commun. 10, 253.
-Shi MJ, Meng XY, Fontugne J, Chen CL, Radvanyi F, Bernard-Pierrot I. Identification of new driver and passenger mutations within APOBEC-induced hotspot mutations in bladder cancer. Genome Med. 2020 Sep 28;12(1):85. doi: 10.1186/s13073-020-00781-y. PMID: 32988402.
-St.Laurent, G., Wahlestedt, C., & Kapranov, P. (2015). The Landscape of long noncoding RNA classification. In Trends in Genetics (Vol. 31, Issue 5, pp. 239–251). Elsevier Ltd. https://doi.org/10.1016/j.tig.2015.03.007 .
-C. Tav, S. Tempel, L. Poligny, Tahi F. miRNAFold : a web server for fast miRNA precursor prediction in genomes. Nucleic Acids Res. Jul 8 ;44(W1) :W181-4. 2016.
-VD Tran, S. Tempel, B. Zerath, F. Zehraoui, Tahi F. miRBoost : Boosting support vector machines for microRNA precursor classification. RNA. A Vol. 21, No. 5, 2015.
– Tran L, Xiao JF, Agarwal N, Duex JE, Theodorescu D. Advances in bladder cancer biology and therapy. Nat Rev Cancer. 2021 Feb;21(2):104-121. doi: 10.1038/s41568-020-00313-1. Epub 2020 Dec 2. PMID: 33268841.
-Uroda, T., Anastasakou, E., Rossi, A., Teulon, J. M., Pellequer, J. L., Annibale, P., Pessey, O., Inga, A., Chillón, I., Marcia, M. (2019). Conserved Pseudoknots in lncRNA MEG3 Are Essential for Stimulation of the p53 Pathway. Molecular Cell, 75(5), 982-995.e9. doi.org/10.1016/j.molcel.2019.07.025
-Wang, L., Zheng, S., Zhang, H., Qiu, Z., Zhong, X., Liu, H., Liu, Y. (2020). ncRFP: A novel end-to-end method for non-coding RNAs family prediction based on Deep Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1–1. https://doi.org/10.1109/tcbb.2020.2982873
-Yang, C., Yang, L., Zhou, M., Xie, H., Zhang, C., Wang, M. D., Zhu, H. (2018). LncADeep: An ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics, 34(22), 3825–3834. https://doi.org/10.1093/bioinformatics/bty428.

Profil du candidat :
Candidats avec un niveau de M2 (ou équivalent) en Bioinformatique, informatique ou Sciences des données.

Formation et compétences requises :
Des connaissances en biologie permettront d’interagir plus rapidement avec les biologistes impliqués dans le projet. Certaines de ces connaissances pourront être également acquises au cours des travaux de recherche. Une forte capacité d’adaptation (nouvelles méthodes, nouvelles thématiques) et une envie d’interagir avec des personnes de différentes spécialités sont requises.

Adresse d’emploi :
Bâtiment IBGBI. 23 bv. de France. 91000 Evry.