
MaDICS est un Groupement de Recherche (GDR) du CNRS créé en 2015. Il propose un écosystème pour promouvoir et animer des activités de recherche interdisciplinaires en Sciences des Données. Il est un forum d’échanges et d’accompagnement pour les acteurs scientifiques et non-scientifiques (industriels, médiatiques, culturels,…) confrontés aux problèmes du Big Data et des Sciences des données.
Pour en savoir plus…
Les activités de MaDICS sont structurées à travers des Actions et Ateliers. Les Actions rassemblent les acteurs d’une thématique précise pendant une durée limitée (entre deux et quatre ans). La création d’une Action est précédée par un ou plusieurs Ateliers qui permettent de consolider les thématiques et les objectifs de l’action à venir.
Le site de MaDICS propose plusieurs outils de support et de communication ouverts à la communauté concernée par les Sciences des Données:
- Manifestations MaDICS : Le GDR MaDICS labellise des Manifestations comme des conférences, workshops ou écoles d’été. Toute demande de labellisation est évaluée par le Comité de Direction du GDR. Une labellisation rend possible un soutien financier pour les jeunes chercheuses et chercheurs. Une labellisation peut aussi être accompagnée d’une demande de soutien financier pour des missions d’intervenants ou de participants à la manifestation.
Pour en savoir plus… - Réseaux MaDICS : pour mieux cibler les activités d’animation de la recherche liées à la formation et à l’innovation, le GDR MaDICS a mis en place un Réseau Formation destiné à divers publics (jeunes chercheurs, formation continue,…), un Réseau Innovation pour faciliter et intensifier la diffusion des recherches en Big Data, Sciences des Données aux acteurs industriels et un Club de Partenaires qui soutiennent et participent aux activités du GDR.
Pour en savoir plus… - Espace des Doctorants : Les doctorants et les jeunes chercheurs représentent un moteur essentiel de la recherche et le GDR propose des aides à la mobilité et pour la participation à des manifestations MaDICS.
Pour en savoir plus… - Outils de communication : Le site MaDICS permet de diffuser des informations diverses (évènements, offres d’emplois, proposition de thèses, …) liées aux thématiques de recherche du GDR. Ces informations sont envoyées à tous les abonnés de la liste de diffusion MaDICS et publiés dans un Calendrier public (évènements) et une page d’offres d’emplois.
Adhésion au GDR MaDICS : L’adhésion au GDR MaDICS est gratuite pour les membres des laboratoires ou des établissements de recherche publics. Les autres personnes peuvent adhérer au nom de l’entreprise ou à titre individuel en payant une cotisation annuelle.
Pour en savoir plus…
Manifestations à venir
Journées Ecoles Conférences et Séminaires
Actions, Ateliers et Groupes de Travail :
CODA DAE DatAstro DSChem EXMIA GINO GRASP RECAST SaD-2HN SIMDAC SimpleText TIDS
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Laboratoire Hubert Curien
Durée : 36 month
Contact : ievgen.redko@univ-st-etienne.fr
Date limite de publication : 2021-04-15
Contexte :
In recent years, deep learning has imposed itself as the state of the art ML method in many real-world tasks, such as computer vision or natural language processing to name a few [1]. While achieving impressive performance in practice, training DNNs requires optimizing a non-convex non-concave objective function even in the case of linear activation functions and can potentially lead to local minima that are arbitrary far from global minimum. This, however, is not the typical behaviour observed in practice, as several works [2, 3] showed empirically that even in the case of training the state-of-the-art convolutional or fully-connected feedforward neural networks one does not converge to suboptimal local minima. Such a mysterious behaviour made studying the loss surface of DNNs and characterizing their local minima one of the topics of high scientific importance for the ML community.
Sujet :
In order to provide novel insights into the behaviour of DNNs, our goal will be to study them as instances of congestion games [4], a class of games often used to model network traffic and communications. This particular choice is due, on one hand, to the fact that both DNNs and congestion games can be modeled as direct acyclic graphs (DAGs), while, on the other, congestion games are arguably among the most studied classes of games in GT that are known to exhibit many attractive properties. The approximate objectives of the Ph.D. thesis in this context will consist in:
1. Proposing a general approach of finding a congestion game equivalent to a given DNN.
2. Translating the different quantities of interest often analyzed in the context of congestion games to DNNs in order to provide a novel theoretical analysis for them.
3. Using extension theorems [5] to study the speed of convergence of online optimization strategies when applied to DNNs.
Some preliminary encouraging results obtained for such an approach have been obtained recently by supervisors and the Ph.D. candidate is expected to address the open problems mentioned in the latter paper.
Profil du candidat :
Ideal candidates will have a strong background in both machine learning and game theory, but anyone with a Master’s degree in applied/pure mathematics is encouraged to apply. Proficiency in at least one programming language commonly used in machine learning community would be a plus.
Formation et compétences requises :
Master’s degree in Applied/Pure math or Machine Learning.
Adresse d’emploi :
Saint-Etienne, France
Document attaché : 202103221643_thesis_proposal.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : INRIA Sophia-Antipolis / LERIA / DGA
Durée : 3 ans
Contact : nicolas.gutowski@univ-angers.fr
Date limite de publication : 2021-04-15
Contexte :
L’Inria Sophia-Antipolis, la DGA – Direction générale de l’armement (site d’Angers) et l’Université d’Angers (LERIA) recherchent un candidat pour une thèse dans le domaine de l’Intelligence Artificielle appliquée à l’analyse de la sécurité du comportement dynamique des véhicules.
Sujet :
Le centre DGA Techniques Terrestres du site d’Angers réalise, entre autres, des essais sur piste pour analyser la sécurité du comportement dynamique des véhicules afin de valider leur utilisation en opération. Ces essais sont réalisés par plusieurs pilotes qui reportent une notation globale à partir de l’évaluation de multiples critères. L’objectif de la thèse est d’intégrer à ces essais une intelligence artificielle en tant qu’outil d’aide à la décision pour la notation.
Profil du candidat :
Titulaire au minimum d’un M2 ou d’un diplôme d’ingénieur, et de citoyenneté de l’Union Européenne, de Suisse ou du Royaume Uni.
Formation et compétences requises :
Le candidat doit aimer le travail en équipe pluridisciplinaire et le goût pour l’expérimentation terrain.
Il possède des connaissances en instrumentation, en capteur, en algorithmique, en intelligence artificielle.
Le candidat sera disponible pour commencer sa thèse au 01/10/2021. Un report d’un mois est envisageable suivant le besoin du candidat.
Adresse d’emploi :
1 an et demi à Sophia-Antipolis et 1 an et demi à Angers.
Document attaché : 202103242150_2021_SujetTheseINRIA_LERIA_DGA.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : ERIC, Université de Lyon
Durée : 3 ans
Contact : julien.jacques@univ-lyon2.fr
Date limite de publication : 2021-04-20
Contexte :
In many areas of humanities and social sciences, the studies are based on questionnaires completed by participants. Often, these questionnaires are completed several times over the study period. The researchers then analyse these questionnaires to determine typical behaviours within the studied population. But the statistical analysis of these questionnaires is far from simple, for several reasons. First, the answers to the questions are often of different types: nominal categorical (for example “what is your socio-professional category?”), ordinal categorical (for example “what is your level of satisfaction: bad, average, good?”), quantitative (“what is your age?”), textual (for open questions with free answer). The analysis of such mixed data is a current research problem in the fields of statistics and machine learning, and for lack of an existing solution the practitioner often tends to transform the data to standardize them. Such approach is not satisfying since it leads either to the introduction of a bias or to an important information loss.
The second scientific obstacle is the modelling of the temporal evolution of the answers to the questions. Currently, the analyses are done independently at each temporal phase, then researchers try a posteriori to find links between these different analyses, by seeking from one phase to the other to find similar typical behaviour. The ideal way to model these data would be to propose a model of the temporal evolution, which models all the responses to the questionnaires at the same time. Thus, the analysis will exhibit typical temporal evolution behaviours, which are the objects which researchers in human and social sciences wish to study.
This thesis will thus provide a complete tool for analysing questionnaires repeated over time. The core of the thesis will be the development of a statistical model and associated inference algorithms. But the PhD student will go as far as the implementation of a software tool in the form of an R package, so that researchers in humanities and social sciences can easily use these results.
Sujet :
In many areas of humanities and social sciences the studies are based on questionnaires completed by participants. The data provided by the answer of participants are of different nature, generally quoted as mixed data in the literature:
– nominal categorical: when questions are of type “what is your socio-professional category?”
– ordinal categorical: when questions are of type “what is your level of satisfaction: bad,
average, good?”,
– quantitative: when questions are of type “what is your age?”,
– textual: for open questions with free answers.
Moreover, we consider repeated questionnaires over time: the participants filled in the questionnaires at several times along the period of study. A time component is then added to the mixed data set.
The machine learning task to which we want answer for this data is unsupervised: there is no specific notions that we want to predict, but we want to explore the data sets in order to exhibit typical behaviours. This task is known as to clustering: we want to build clusters of data such that observations within a cluster are similar and clusters are different from each other. Thus, the data analysis will no longer be based on the observation of the individual responses to the questionnaires, but on the summaries provided by the clusters.
More specifically, once the whole data set of observations over time will be clustered, the clusters will gather set of participants which have the same evolution of answers over time. This information is essential for data analysis from a humanities and social sciences point of view.
Profil du candidat :
Master ou Ingénieur en mathématiques appliquées, informatiques.
Candidature sur
https://www.universite-lyon.fr/version-anglaise/call-for-applications-doctoral-contracts-in-artificial-intelligence-217192.kjsp?RH=3484404878389026
Formation et compétences requises :
Machine learning, R ou Python, Statistical learning
Adresse d’emploi :
Université Lyon 2, Campus Porte des Alpes, Bron
Document attaché : 202104010502_IA PhD – JACQUES – PRIM-ALLAZ.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Laboratoire des Signaux et Systèmes
Durée : 3 ans
Contact : francois.orieux@l2s.centralesupelec.fr
Date limite de publication : 2021-04-22
Contexte :
More details can be found here https://pro.orieux.fr/files/data-fusion.pdf
The objective of the subject is to develop new methods, notably inspired by machine learning approach, for fusion of heterogenous data, hyper and multispectral images in particular. It takes place in the international James Webb Space Telescope (JWST) project, the most ambitious space telescope scheduled for launch in October 2021 and which will carry four instruments with unprecedented hyperspectral and multispectral observing capabilities.
Hyperspectral and multispectral imaging are ubiquitous in many observation modalities like Earth observations (like the European Copernicus project), astronomy, medical imaging, material analysis. Thus the work carried out during the PhD should also be applicable to many other modalities.
This thesis is part of the scientific project of the joint laboratory LAB4S2 (Innovative Laboratory for Space Spectroscopy) between the IAS and the company ACRI-ST (https://www.acri- st.fr/fr). The objective of LAB4S2 is to develop fusion methods for the joint exploitation of hyperspectral and multispectral data obtained, on the one hand, by the mid-infrared instrument of the JWST (MIRI), and, on the other hand, by the instruments on board the Sentinel missions of the Copernicus project.
Sujet :
The aim of the thesis is to develop efficient algorithms for joint processing:
– of multispectral data obtained with imager over a large field of view and broadband filters ;
– with hyperspectral data with high spectral resolution but for small fields of view.
This problem, which can be likened to a data fusion problem, has been addressed in Earth observations (with so-called pansharpening methods), in a context where the effects due to the observing instruments (limited spatial resolution and insufficient spatial sampling) are negligible.
The problem can be solved by minimization of a loss function with explicit data models for multispectral (imaging) and hyperspectral (spectroscopic) instruments. This approach solves the problem of measurement heterogeneity by relying on a “virtual” instrument that combines imaging and spectroscopy.
The above-mentioned method is model-driven. However, the model can be inefficient is some case, when the degradations due to the instruments, like spatial blurring, are important for instance. A common approach is to add a regularizer. This approach can have some limitations. In particular, the prior models are often ad hoc and the compromise between the different sources of information must be tuned. During this work we will explore alternative methods based on machine learning to construct more adapted prior. The challenge here will be the lake of a big database necessary for supervised approach. Therefore we will explore new methods based on small databases like Few-Shot approach. This approach consists of adapting trained DNN with existing databases (like ImageNet) with another database that have only a few examples.
Profil du candidat :
A professional attitude is expected.
Formation et compétences requises :
The candidate must have a M2 degree in signal processing, data science, or machine learning. Knowledge in physics or optics is appreciated.
Adresse d’emploi :
Laboratoire des Signaux et Systèmes (Université Paris-Saclay)
3 rue Joliot-Curie, 91192 Gif-sur-Yvette
Document attaché : 202102221056_data-fusion.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : CNAM, CEDRIC
Durée : 3 years
Contact : stage.isid-mim@cnam.fr
Date limite de publication : 2021-04-25
Contexte :
Nowadays, organizations move forward the digital transformation. They are challenged by new Information and Communications Technologies (ICT) such as Industrial Internet-of-Things and Artificial Intelligence based Applications, Cyber-Physical Systems, Blockchain, 5G Networks, and so on [2]. Industry 4.0 (I4.0), commonly referred to as the fourth industrial revolution, is about the current trends of automation and data exchange in manufacturing technologies [2]. Digitalization and Industry 4.0 are profoundly changing society, economy and the way businesses operate. To face these challenges, organizations design and implement new business and Information Technology models based on Distributed Systems to guarantee that the company will reach the objectives defined by new opportunities and limit any threat of an unfavorable environment.
Numerous research works are conducted to study the impact of these digital technologies on enterprise management, for instance, within the Business-IT Alignment [3] or Enterprise Architecture [4] fields. However, there is a lack of alignment of Industry 4.0 ICT to the business strategy and business goals. The existing literature still does not respond to this compelling and deep need. For instance, the most known framework dealing with it is the “Reference Architectural Model Industrie 4.0” (RAMI 4.0), proposed by the Standardization Council Industry 4.0 (SCI4.0) [5] [6]. It defines a three-dimensional framework to structure and define Industry 4.0 components. Even if one of the axes of this model includes the business layer, it is reduced to organization and business processes. This lack of a powerful alignment metaphor is a weakness of RAMI 4.0 proposal and of the other existing approaches. The lack of alignment induces less efficient distribution of resources, less adapted configuration, lower level of business goals achievement and of adaptation to the context.
Sujet :
The research problem we want to address in this PhD thesis proposal is related to these weaknesses. From our point of view, a completely new approach should encompass both business strategy and digital technologies deployed to support it through an intentional perspective. The notion of intention is essential for organizations as it allows requirements of internal and external users of ICT to be satisfied. The teleological (intention-oriented) perspective gains traction in many fields including different organizational aspects as it allows artifacts under consideration to be connected to business and other needs [7]. For example, Intent-Based Networking is an emerging approach allowing the configuration of the physical and virtual network infrastructure depending on business strategies requirements [8].
The intention-based conciliation approach would provide means for a context-aware adoption and configuration of underlying digital technologies. Corresponding to this assumption, we propose to apply approaches issued from the field of Situational Method Engineering (SME) [1] [9] [10]. SME aims at building methods adapted to concrete situations from a set of reusable method components depending on the context factors. Several approaches exist: assembly, extension, configuration and so on [9]. One of the approaches is based on software product lines and develops a family of methods based on variability [1]. During the application, the family is configured accordingly to a given context and to the users’ intentions [10]. Applied to this PhD project at hand, components should be defined at the level of ICT components and the family at the level of the infrastructure.
The research goal of this PhD project is to provide an intention-based approach to ease the integration of new ICT into organizations dynamically during all the Information System lifecycle. Organizations would be able to adapt and to stick, as quickly as possible, to business requirements changes. The work on this thesis includes:
– Preparation of a State-of-the-Art on intention-based approaches in ICT and their adoption by organizations,
– Formalization of the concept of a reusable ICT component allowing its contextual configuration,
– Elaboration of the ontology of intentions adapted to the usage of ICT components,
– Elaboration of a framework allowing to relate ICT components to business strategy through the intentional layer,
– Proposal of an approach for the contextual selection and configuration of ICT components including the selection of the appropriate technique to bring this ability to adapt the architecture of the applications and the underlying technologies.
References:
1. Kornyshova, E.; Deneckere, R. and Rolland, C. Method Families Concept: Application to Decision-Making Methods. In Enterprise, Business-Process and Information Systems Modeling, pages 413-427, Springer, London, United Kingdom, Lecture Notes in Business Information Processing 81, 2011.
2. Lu Y., Industry 4.0: A survey on technologies, applications and open research issues. Journal of Industrial Information Integration 6 (2017) 1–10.
3. Issa A., Hatiboglu B., Bildstein A., Bauernhansl T., Industrie 4.0 roadmap: Framework for digital transformation based on the concepts of capability maturity and alignment, Procedia CIRP, Volume 72, 2018, Pages 973-978.
4. Aldea A., Iacob M-E., Wombacher A., Hiralal M., Franck T. Enterprise Architecture 4.0 – A vision, an approach and software tool support. IEEE 22nd International Enterprise Distributed Object Computing Conference. 2018.
5. The Reference Architectural Model RAMI 4.0 and the Industrie 4.0 Component. 2015. https://www.zvei.org/en/subjects/industrie-4-0/the-reference-architectural-model-rami-40-and-the-industrie-40-component/, Accessed on January 2020.
6. Adolphs P., Bedenbender H., Dirzus D., Ehlich M., Epple U., Hankel M., Heidel R., Hoffmeister M., Huhle H., Kärcher B., Koziolek H., Pichler R., Pollmeier S., Schewe F., Walter A., Waser B., Wollschlaeger M. Reference Architecture Model Industrie 4.0 (RAMI4.0). 2015.
7. Deneckere, R. and Kornyshova, E. Processus téléologique et variabilité : Utilisation de la sensibilité au contexte. In Revue des Sciences et Technologies de l’Information – Série ISI : Ingénierie des Systèmes d’Information, 16:1: 61-88, 2011.
8. Han Y., Li J., Hoang D., Yoo J.-H., Won-Ki Hong J., An intent-based network virtualization platform for SDN, 12th International Conference on Network and Service Management (CNSM), 2016, 353-358.
9. Ralyté, J., Deneckère, R., Rolland, C., Towards a Generic Method for Situational Method Engineering, International Conference on Advanced Information Systems Engineering (CAISE), Klagenfurt, Austria, 95-110, 2003.
10. Deneckère R. and Kornyshova E. Process Line Configuration: an Indicator-based Guidance of the Intentional Model MAP, Evaluation of Modeling Methods in Systems Analysis and Design (EMMSAD), Hammamet, Tunisie, 2010.
Profil du candidat :
The candidate have very good skills in distributed systems, information systems engineering and process modeling and be interested in intention-oriented approaches and new digital technologies (CPS, Industrial IoT, etc.). The candidate should have good writing skills in English. He/she must be highly motivated, independent, with a real ability to organize and follow a schedule.
Formation et compétences requises :
The candidate must hold a M.Sc. in Computer Science (or equivalent).
Adresse d’emploi :
Conservatoire National des arts et Métiers, CEDRIC Lab, 2, rue Conté 75003 Paris France
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : CREATIS et LIRIS – 69 Villeurbanne
Durée : 36 mois
Contact : carole.lartizien@creatis.insa-lyon.fr
Date limite de publication : 2021-04-30
Contexte :
Application of machine learning (ML) to healthcare is among the most challenging ones with the potential to exploit information provided by an exponentially growing mass of heterogeneous data (images, semantic information, biological parameters,..). Those models require a large amount of data to perform well, particularly in the era of large-scale deep neural networks. One option to increase the training population is to promote multi-centre clinical studies, which opens many privacy-related problems since data producers lose control over their data as well as huge data traffic. Federated learning (FL) is a new ML approach that was recently introduced to counterbalance the need to access large databases by the responsibility to maintain the privacy of individual participants. In this context, FL appears as a very promising technique, first to account for patient privacy thus complying with the increasingly stringent general data protection regulations (GDPR) and then to limit the huge amount of data traffic required when gathering medical data to a centralized server. This research field is in its early premise and needs to address key challenges related to the specificity of medical data.
Sujet :
The aim of this PhD is to investigate methodological research in this domain with application to the design of diagnosis and prognosis models of brain pathology based on multimodality imaging.
Please see the detailed description in the attached pdf file.
You can also have a look here :
https://www.creatis.insa-lyon.fr/site7/fr/node/47088
This PhD project is funded by the IADoc@UDL program promoting research in in AI.
Profil du candidat :
The candidate is expected to have strong knowledge in machine learning and some experiment in image processing. Some prior experience with medical image processing would be appreciated but is not required. Good programming skills (python..) are also required. We are looking for an enthusiastic and autonomous student with strong motivation and interest in multidisciplinary research (image processing and machine learning in a medical context).
Formation et compétences requises :
The candidate is expected to have strong knowledge in machine learning and some experiment in image processing. Good programing skills are required
Adresse d’emploi :
The doctoral student will share his/her time between the two laboratories (Both located on Campus La Doua – Villeurbanne) according to the needs and the progress of the project.
Please contact us for more information on the project and the application procedure.
Deadline for application is April 23
Document attaché : 202103180954_PhDproject_IADoc_FL_MedImag_CREATIS_LIRIS_2021.pdf
Offre en lien avec l’Action/le Réseau : – — –/Doctorants
Laboratoire/Entreprise : Lab-STICC UMR CNRS, site ENSTA Bretagne
Durée : 10 mois
Contact : jean-christophe.cexus@ensta-bretagne.fr
Date limite de publication : 2021-04-30
Contexte :
Ce projet s’insère dans le cadre de la caractérisation de l’environnement et la description fine d’une scène observée pour des applications de détection, localisation et le suivi de cibles éventuelles (avion, navire, véhicule…). Dans le cadre de ce travail, nous nous intéressons aux méthodes innovantes en Intelligence Artificielle (IA) telles que les méthodes de Deep learning pour l’analyse des données pouvant provenir aussi bien de capteurs « conventionnels » (Radar, Optique, Lidar…) que de « sources d’informations » moins conventionnels (cartes météorologiques, cartes géographiques, connaissances opérationnelles). Dans le cadre d’une fonction de prise de décision, l’ensemble des différentes sources fournissent la matière de base, notre objectif étant le développement de nouvelles architectures de traitement permettant d’améliorer les performances et l’adaptabilité du système dans des nouvelles tâches/actions.
Sujet :
En conséquence, le travail demandé portera sur le développement d’outils dédiés à l’apprentissage profond évolutif. Les approches d’apprentissage à développer doivent être capables de s’adapter à de nouvelles situations non apprises dans le contexte de la détection et la reconnaissance de cibles à partir de données hétérogènes (images radar, images optiques, etc).
Mots clés
Machine learning, Deep learning, Detection et reconnaissance, Transfer learning, Automatic Target Recognition, Data Science
Profil du candidat :
Ce poste est ouvert aux titulaires d’une thèse de doctorat dans l’un des domaines indiqués dans le sujet.
Formation et compétences requises :
Compétences attendues
Ce poste est ouvert aux titulaires d’une thèse de doctorat dans l’un des domaines indiqués dans les objectifs, et en particulier avec les compétences :
– Machine learning, data science
– Traitement du signal et des images
– Mathématiques appliquées
– Maîtrise de la programmation informatique : Python, Tensorflow, pytorch, …
Des connaissances dans le domaine radar et de télédétection seraient un plus.
Adresse d’emploi :
Le poste est localisé à l’ENSTA Bretagne au sein du département STIC. Celui-ci compte une centaine de personnes dont une quarantaine de permanents. Les thématiques d’enseignements se retrouvent principalement dans les spécialités des systèmes d’observation (acoustique, électromagnétique, …), hydrographie, la robotique, l’intelligence artificielle, la modélisation logicielle et la sécurité des systèmes (cyberdéfense).
Les enseignants-chercheurs du département sont, pour la grande majorité, membres du Lab-STICC (Laboratoire des Sciences et Techniques de l’Information, de la Communication et de la Connaissance, UMR CNRS 6285) dont l’ENSTA Bretagne est tutelle.
Envoyer un CV et une lettre de motivation à :
– abdelmalek.toumi@ensta-bretagne.fr
– ali.khenchaf@ensta-bretagne.fr
– jean-christophe.cexus@ensta-bretagne.fr
Document attaché : 202103100959_Fiche_Poste_ML.pdf
Offre en lien avec l’Action/le Réseau : – — –/Doctorants
Laboratoire/Entreprise : LITIS, Rouen Normandie
Durée : 36 mois
Contact : paul.honeine@univ-rouen.fr
Date limite de publication : 2021-04-30
Contexte :
Récemment des approches géostatistiques adaptées aux particularités des sites et sols pollués ont été développées et sont de plus en plus utilisées pour cartographier les sols en fonction de seuils réglementaires ou estimer des quantités de matériaux contaminés, tout en quantifiant les incertitudes locales ou globales associées. Le couplage des deux outils, mesures sur site et méthodes géostatistiques, permet d’envisager des campagnes de reconnaissance optimisées où le nombre et la localisation de nouveaux points de mesure sont déterminés au fur et à mesure de l’acquisition des données dans l’objectif de réduire l’incertitude affectant la modélisation de la contamination.
Sujet :
L’objective est la réduction les incertitudes liées aux coûts de la dépollution de sols de friches et sites pollués, dues au modèle géostatistique avec un échantillonnage souvent aléatoire peu représentatif. Pour ce faire, plusieurs projets ont été soutenus par l’ADEME au fil du temps. Les méthodes géostatistiques actuellement utilisées sont basées sur le calcul de variogrammes par krigeage (interpolation spatiale à variance minimale). Le krigeage en géostatistiques n’est autre que les processus gaussiens en apprentissage statistique.
La présente thèse vise à développer de nouvelles méthodes issues de la théorie de l’apprentissage statistique avec les réseaux de neurones à architectures profondes (Deep Learning en anglais). En effet, le domaine applicatif visé, et plus généralement les géostatistiques, n’ont pas encore profité de l’émergence des méthodes de l’apprentissage statistique avec les réseaux profonds. Deux axes de recherche sont envisagés au cours de cette thèse :
– une meilleure modélisation en couplant l’apprentissage profond aux processus gaussiens et
– le développement de méthode de ré- échantillonnage optimal.
Profil du candidat :
– Étudiante ou étudiant en M2 ou 5ème année d’une école d’ingénieur, avec une motivation particulière pour les sciences des données et/ou les statistiques avancées.
– De solides compétences en programmation en Python et/ou C++.
– Un bon sens des relations humaines pour travailler en étroite collaboration avec la startup Tellux.
Formation et compétences requises :
Étudiante ou étudiant en M2 ou 5ème année d’une école d’ingénieur, avec une motivation particulière pour les sciences des données et/ou les statistiques avancées.
Adresse d’emploi :
Localisation : Rouen
Statut : Salarié de l’ADEME
Document attaché : 202103092251_sujet thèse ADEME.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Observatoire de Lyon
Durée : 36 mois
Contact : ferreol.soulez@univ-lyon1.fr
Date limite de publication : 2021-05-01
Contexte :
Natural sciences such as astrophysics, geophysics and nuclear physics often use numerical simulations to model highly complex physical systems. These simulations are now more and more accurate thanks to the computational power available. For example, 3D convection models can simulate the thermochemical evolution and structure of stars and planets.
However, to disentangle different models and to estimate physical parameters (e.g., initial conditions), the outputs of these simulations have to be compared to observations. This confrontation of simulations to observations is a major challenge in natural sciences. Indeed, numerical simulations are now able to model quite accurately objects that are impossible to observe directly (e.g., interior of stars and planets, stars and black holes environment …). As for the observations, although their quality and quantity are rapidly increasing, they are often only indirectly related to the actual parameters of interest (e.g., seismic waves observations are used to construct images of the earth mantle, measured interferometric visibilities are used to characterize planet forming disk …).
To infer simulation parameters from observations is very challenging. When a single simulation is computationally intensive, it is impossible to use either stochastic or continuous optimization methods to infer parameters. In most cases, one can only rely on finding the best fits on a low dimensional pre-computed grid of model parameters.
Sujet :
The ultimate goal of the proposed thesis is to build a fast interpolation method on a grid of computational physics simulated images (in a broad sense as it can also be 3D volumes or spectra). With such a method, we will quickly have an approximation of a simulated image from any possible set of parameters, without having to run the expensive simulation. It then will be possible to use any method (optimization, Bayesian inference) to derive the so sought-after distribution of parameters.
The main idea is to use a deep learning framework to build the interpolator. Indeed, all possible modeled images are concentrated on a lower-dimensional subspace or manifold. Deep neural networks such as Generative Adversarial Networks (GAN) [1] appear to be very efficient to model manifolds and could be much more efficient interpolators than classical polynomial interpolators. Trained on a grid on simulated images, these deep neural networks will produce continuous approximations of the images. As a toy example, in a properly defined manifold, the images of a single circle vary continuously with the circle radius. Interpolation between two images of circles with different radius must follow this manifold whereas any polynomial interpolation will produce an image with two circles rather than an image of a single circle with intermediate radius.
Grids of models are quite ubiquitous in physics, and hence such a project can have important impact. To ensure that it will be both robust and useful in practice, the deep learning based interpolator will be developed for two different applications: (i) planet forming disk characterization using VLTI in collaboration with J. Kluska (KU Leuven) and (ii) reconstruction of mantle structure based on geophysical surface observations.
Profil du candidat :
The candidates must hold a national master degree or equivalent in Applied Math, Signal/Image Processing or Machine Learning.
He/She should have a good coding proficiency ( Julia / Python) and a real interest in the underlying astrophysical or geophysical questions.
Formation et compétences requises :
The candidates must hold a national master degree or equivalent in Applied Math, Signal/Image Processing or Machine Learning.
Adresse d’emploi :
Observatoire de Lyon
9 avenue Charles André
Saint Genis Laval
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : LITIS- EA 4108, Université de Rouen
Durée : 36 mois
Contact : stephane.nicolas@univ-rouen.fr
Date limite de publication : 2021-05-07
Contexte :
Ce sujet de recherche s’inscrit dans le cadre d’une collaboration entre informaticiens, historiens et archivistes initiée en 2009 par le projet DocExplore 2009-2013 (http://www.docexplore.eu), projet du Programme de Coopération Transfrontalière Franco-Britannique Interreg IVa France (manche) – Angleterre. Cette collaboration avec historiens et archivistes a été poursuivie au niveau régional dans le cadre du projet PlaIR 2.0 soutenu par le GRR TL-TI de 2013 à 2016, puis étendue à d’autres acteurs dans le cadre du projet PlaIR2018 soutenu par le FEDER et la Région Normandie de 2017 à 2020. Cette collaboration vise à l’élaboration d’une plateforme logicielle pour l’étude et la valorisation de documents historiques anciens, en particulier médiévaux, dans le but de faciliter le travail des historiens qui étudient ces documents et des conservateurs qui cherchent à les valoriser. Cette plateforme doit leur offrir des fonctionnalités avancées d’analyse d’images et de l’écriture, et de recherche d’information par indexation automatique (http://spotting.univ-rouen.fr).
Du point de vue fondamental, les travaux proposés dans cette thèse s’inscrivent dans le thème “Apprentissage conjoint représentation/décision” de l’équipe Apprentissage du LITIS et concernent plus particulièrement un des points importants développés dans l’équipe à savoir l’apprentissage automatique de représentation pour des tâches de détection.
Sujet :
L’objectif de cette thèse est de développer des techniques robustes de détection de patterns (pattern spotting) et de découverte de motifs (pattern discovery) dans les images de documents, en s’appuyant sur les avancées récentes en Deep Learning. Le pattern spotting permet de rechercher et de localiser précisément, dans l’image d’un document, les occurrences d’un « objet » graphique, c’est à dire une forme plus ou moins complexe telle par exemple qu’un logo, une signature, une lettrine, un symbole, une croix, un blason, … la requête étant formulée en désignant dans l’image un exemple de l’objet à rechercher (requête image). L’intérêt du pattern spotting est de faciliter la recherche d’information dans des bases de documents historiques numérisés relativement complexes comme des documents médiévaux par exemple. Le pattern discovery permet quant à lui d’identifier automatiquement dans les grandes bases d’images de documents, des catégories de motifs graphiques, ou plus généralement des objets, de manière non supervisée, c’est à dire sans connaissance a priori sur les classes d’objets, ni même sur le nombre de classes possibles. L’objectif est de pouvoir découvrir dans les images de documents des structures graphiques qui se répètent ou qui sont similaires lorsqu’elles sont analysées à un certain niveau d’abstraction. Ces deux modes d’utilisation, en recherche et en découverte, d’un tel système d’indexation pourraient être d’une grande utilité pour les historiens, afin de trouver de manière efficace des motifs spécifiques dans des grandes bases d’images de documents hétérogènes, ou de découvrir des relations entre des motifs similaires présents dans des manuscrits différents et présentant des variations de style de représentation plus ou moins importantes.
Nous nous appuierons pour cela sur les travaux menés dans le cadre de la thèse de Sovann En (soutenue en 2016) où nous avons proposé un système complet de recherche d’images et de localisation des objets graphiques de petite taille dans des images de documents médiévaux [En et al., 2016]. Ce système est basé sur une première extraction/indexation des régions d’intérêt dans l’image (region proposal / BInarized Normed Gradients), d’une caractérisation de ces régions par des descripteurs ad-hoc (Vector of Locally Aggregated Descriptors et Fisher Vector), et d’une recherche par similarité à la requête intégrant des techniques de compression et d’approximation (Inverted File, Product Quantization et Asymmetric Distance Computation). Si ce système a montré de bonnes performances sur le corpus d’images de documents étudié [En et al., 2017], il souffre toutefois d’un certain nombre de faiblesses qui rendent ce système peu adaptable à d’autres types d’images de documents (l’information couleur n’est actuellement pas exploitée par exemple), très sensible aux variations de taille, de forme, de couleur et plus généralement de style, des motifs à détecter. D’autre part, ce système supporte difficilement le passage à l’échelle et nécessite des post-traitements pour une localisation fine des objets dans les régions d’intérêt, à l’aide par exemple de méthodes classiques de matching. Enfin, le mode d’interrogation supporté suppose que l’utilisateur puisse présenter au système un exemple graphique visuellement ressemblant de l’objet qu’il souhaite rechercher. Cette condition est très forte, et difficile en pratique à réaliser. Il serait plus pratique pour l’utilisateur de pouvoir fournir au système de recherche, une description sémantique des objets qu’il recherche, ou bien qu’il puisse en donner une description graphique plus sommaire (par exemple à partir d’un schéma ou d’un dessin à main levée). Il faut donc que le système d’indexation et de recherche soit plus tolérant aux variations de représentation (ou de style graphique) d’un même objet, et qu’il permette de lier une description sémantique de haut niveau à de multiples représentations graphiques d’un même objet, ce qui suppose d’être dans un contexte supervisé pour apprendre des modèles d’objets préalablement identifiés. Or il n’est pas possible de savoir a priori sur quel type d’objet va porter la recherche de l’utilisateur. Un moyen de contourner ce problème peut être alors d’apprendre de manière non supervisée lors de l’indexation quelles sont les structures similaires présentes dans les données (corpus indexé) à différents niveaux de représentation.
L’objectif de la thèse est donc d’explorer les techniques d’apprentissage de représentation (deep learning) récemment proposées dans la communauté « object detection » pour contourner ces difficultés. Dans un premier temps, il s’agira d’étendre les capacités du système de spotting pour le rendre moins sensible aux variations de représentation (en termes de taille, de forme ou encore de couleur). Nous pourrons nous appuyer pour cela sur les techniques de type Faster R-CNN [Ren et al.. 2017] qui devront être étudiées et adaptées pour remplacer avantageusement l’extraction de régions d’intérêt (region proposal) basée sur BING. De même, VLAD et Fisher Vector supportent mal la caractérisation des petites régions et des textures couleur ; on pourra s’inspirer des deep features, comme celles proposées par exemple par [Zhou et al., 2016] ou [Babenko et al., 2015], pour une meilleure caractérisation des régions. Enfin, les techniques de Deep Supervised Hashing, comme celles proposées récemment dans [Liu et al., 2016] ou [Jiang and Li, 2017], devraient permettre de faire face au passage à l’échelle pour une recherche par similarité plus efficiente. Cette première partie de la thèse fera également suite à plusieurs travaux réalisés dans le cadre de collaborations internationales entre l’équipe Apprentissage du LITIS et d’autres équipes de recherche [Wiggers et al., 2018], [Wiggers et al., 2019], [Ubeda et al., 2019] et [Ubeda et al., 2020].
Dans un deuxième temps, il s’agira d’étudier l’application de ces modélisations profondes à la découverte de motifs, dans un cadre non supervisé, dans de grands corpus d’images de documents pour permettre une indexation plus fine de ces corpus à différents niveaux de représentation, autorisant ainsi des exploitations de ces contenus indexés qui doivent mieux correspondre aux attentes de l’utilisateur (recherche sémantique de haut niveau, recherche de similarité graphiques, recherche de similarité sémantique). On pourra s’inspirer par exemple de techniques récentes telles que celles proposées dans [Doersch et al., 2015], [Seguin et al., 2016] ou [Shen et al. 2019], pour apprendre des représentations adaptées au cadre non supervisé.
L’équipe Apprentissage mettra à disposition pour la réalisation de ce travail de nombreuses collections d’images de documents, données acquises et annotées dans le cadre du projet DocExplore [En et al., 2016] et qui ont fait l’objet d’une convention signée entre l’Université de Rouen et la Bibliothèque Municipale de Rouen. Ces données, annotées au niveau pattern, permettront de conduire une réelle évaluation expérimentale, car en grandeur nature, du travail de recherche qui, par l’importance du sujet et l’originalité des approches proposées, pourra être valorisé par des publications dans des revues internationales de haut niveau et par l’intégration de nouvelles fonctionnalités dans la plateforme PlaIR et la suite logicielle DocExplore.
Références:
[En et al., 2017] En, S., Nicolas, S., Petitjean, C., Jurie, F., Heutte, L. New public dataset for spotting patterns in medieval document images. Journal of Electronic Imaging, vol. 26, no. 1, 2017.
[En et al., 2016] En, S., Petitjean, C., Nicolas, S., Heutte, L. A scalable pattern spotting system for historical documents. Pattern Recognition, vol. 54, pp. 149-161, 2016.
[Ren et al.. 2017] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Pattern Analysis and Machine Intelligence IEEE Transactions on, vol. 39, pp. 1137-1149, 2017
[Zhou et al., 2016] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba. Learning Deep Features for Discriminative Localization. CVPR2016, pp. 2921-2929, 2016.
[Babenko et al., 2015] Babenko, V. Lempitsky. Aggregating Local Deep Features for Image Retrieval. ICCV 2015, pp. 1269-1277, 2015.
[Liu et al., 2016] H. Liu, R. Wang, S. Shan, X. Chen. Deep Supervised Hashing for Fast Image Retrieval; CVPR 2016, pp. 2064-2072, 2016.
[Jiang and Li, 2017] Q.Y Jiang, W.J. Li. Asymmetric Deep Supervised Hashing. arXiv preprint arXiv:1707.08325, 2017.
[Ubeda et al., 2020] I. Ubeda, J. Saavedra, S. Nicolas, C. Petitjean, L. Heutte. Improving pattern spotting in historical documents using feature pyramid networks. Pattern Recognition Letters, vol. 131, pp. 398-404, 2020.
[Ubeda et al., 2019] I. Ubeda, J. Saavedra, S. Nicolas, C. Petitjean, L. Heutte. Pattern spotting in historical documents using convolutional models. 5th International Workshop on Historical Document Imaging and Processing, HIP@ICDAR 2019, Sydney, NSW, Australia, pp. 60-65, 2019.
[Wiggers et al., 2018] K. Wiggers, A. Britto, L. Heutte, A. Koerich, L. Oliveira. Document image retrieval using deep features. 2018 International Joint Conference on Neural Networks, IJCNN2018, Rio de Janeiro, Brazil, pp. 1-8, 2018.
[Wiggers et al., 2019] K. Wiggers, A. Britto, L. Heutte, A. Koerich, L. Oliveira. Image retrieval and pattern spotting using siamese neural network. International Joint Conference on Neural Networks 2019, IJCNN2019, Budapest, Hungary, pp. 1-8, 2019.
[Doersch et al., 2015] Doersch, A. Gupta, A. Efros. Unsupervised visual representation learning by context prediction. ICCV2015, pp. 1422–1430, 2015.
[Seguin et al., 2016] Seguin, C. Striolo, I. di Lenardo, F. Kaplan. Visual link retrieval in a database of paintings. ECCV2016, pp. 753–767, 2016.
[Shen et al., 2019] X. Shen, A. Efros, M. Aubry. Discovering Visual Patterns in Art Collections With Spatially-Consistent Feature Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9278-9287, 2019.
Profil du candidat :
Master 2 en informatique, mathématiques appliquées, ou école d’ingénieur
Le candidat devra impérativement envoyer son CV et ses relevés de notes (L3, M1 et année en cours, avec les classements), ainsi qu’une lettre de motivation, à Laurent HEUTTE et Stéphane NICOLAS (cf coordonnées ci-après), au plus tard pour le 7 mai 2021.
Formation et compétences requises :
Le candidat recherché doit être titulaire d’un Master (ou équivalent) dans le domaine de l’Informatique avec une dominante Traitement du Signal et des Images ou Sciences des Données. Il doit avoir de solides connaissances en apprentissage et classification, notamment en Deep Learning, et des compétences en Image Retrieval.
Adresse d’emploi :
Equipe d’accueil:
Equipe Apprentissage, laboratoire LITIS (EA 4108), Université de Rouen
http ://www.litislab.fr/equipe/docapp/
Encadrement :
Laurent HEUTTE (directeur), laurent.heutte@univ-rouen.fr, (+33) 2 32 95 50 14
Stéphane NICOLAS (co-encadrant), stephane.nicolas@univ-rouen.fr, (+33) 2 32 95 52 14
Document attaché : 202103311414_sujet_alloc_URN_spotting_2021.pdf
Offre en lien avec l’Action/le Réseau : BigData4Astro/Doctorants
Laboratoire/Entreprise : Institut de Planétologie et d’Astrophysique de Gr
Durée : 36 months
Contact : mickael.bonnefoy@univ-grenoble-alpes.fr
Date limite de publication : 2021-05-07
Contexte :
More than 4500 exoplanets have been discovered as of now, most of them being formed billions of years ago. The recent direct imaging detection of planets still in the process of formation [1] opens an unprecedented observing window of the initial stages of planetary systems (tens of millions of years ages).
This discovery was made possible thanks to the advances of efficient adaptive optics systems coupled to medium-resolution integral field spectrographs (IFS), producing hyperspectral data at high spatial and spectral resolutions. The data diversity can be used to remove the bright stellar halo and isolate the faint and sparse planetary signal. A large potential exists to improve the overall data processing strategy and boost our detection capabilities.
Sujet :
The student will lead the development, implementation (testing and validation), and deployment of novel algorithms for IFS data processing and increasing the detection of forming planets. The work will be split into three main items:
the student will adapt specific methods [2] to improve the quality of the pre-processing steps of our hyperspectral data and reduce the false positive rate;
He/She will propose and adapt algorithms for detecting sparse planetary signals in the optimally processed data (match filter, pattern recognition,…);
She/He will then massively apply these tools on existing and forthcoming data obtained at the Very Large Telescope (VLT; Chile) to detect new planets.
She/He will be given the opportunity to lead his/her own observing programs, and participate in the scientific preparation of the ongoing instruments VLT-SPHERE+ and ELT-HARMONI which will soon offer additional opportunities for detecting and characterizing such planets. The algorithms and discoveries will be published in reference papers and the codes publicly released.
[1] Haffert et al. 2019, Nature Astronomy, 3, 749
https://arxiv.org/pdf/1906.01486.pdf
[2] Berdeu et al. 2020, A&A, 635, 90
https://www.aanda.org/articles/aa/pdf/2020/03/aa36890-19.pdf
The student will be co-supervised by Mickaël BONNEFOY and Philippe DELORME at IPAG (Grenoble, France) and Ferréol SOULEZ at CRAL (Lyon, France). The student will work within the framework of the ANR project FRAME hosted at IPAG and coordinated by M. BONNEFOY. FRAME is focused on protoplanet detection and characterization. As such, He/She will be part of vibrant teams of researchers including astronomers expert in star formation and exoplanets and data scientists. She/He will also collaborate closely with experts in signal processing in the local area (Grenoble, Saint-Etienne, Lyon).
Profil du candidat :
We are looking for a Master Student with a background in Data Science and strong interest in astrophysics. The student should show a proficiency for solving complex problems rigorously and for dealing with data and algorithms. She/He should have excellent writing skills in English (French is a plus) and be able to present her/his work. Teamwork skill is essential.
Formation et compétences requises :
Master Degree in Applied Mathematics, Signal Processing, Data Science
Knowledge of spectroscopy and imaging. Knowledge of hyperspectral imaging is appreciated.
Coding skills in Python. Knowledge of Julia and Matlab are a plus.
Writing skills in English
Adresse d’emploi :
Institut de Planétologie et d’Astrophysique de Grenoble
414 Rue de la Piscine
38400 SAINT-MARTIN D’HÈRES
FRANCE
Document attaché : 202104150950_Offre – Detection of forming planets with integral field spectrographs.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Laboratoire Hubert Curien ( LabHC)
Durée : 36 mois
Contact : christine.largeron@univ-st-etienne.fr
Date limite de publication : 2021-05-08
Contexte :
In many applications, the data to be studied is relational, modeled in the form of a network and represented by graphs. This representation makes possible the study of interactions between people, especially in social sciences. Although network analysis is an active branch of data min-ing and machine learning, a majority of works in other fields focuses also graphs where the nodes correspond to the entities of the network and the links (edges or arcs) to their relation-ships, for instance in biology or image processing.
Among the main tasks studied on this data, one can mention supervised and unsupervised cluste-ring, link prediction or, recommendation.
Sujet :
One of the main limitations of the existing methods allowing to solve these issues is that they usually consider the network to be fully known so that it can be perfectly defined by a graph or by a sequence of graphs for a dynamic network. However, in practice, the graph is often only partially observable, or the data can be observed with a degree of uncertainty. This uncertainty is derived from the presence or absence of a node (or link) at a given moment and, if applicable, the lack of weight or orientation characterizing the link.
Although it is very widespread in practice, this uncertainty is not without effect on the outcomes resulting from classical graph analysis but it is understudied in the literature. Concerning the classi-fication of entities, one can cite the works of Sevon et al. or Pfeiffer et al.. For the prediction of links, Mallek et al., Ahmed et al., Murata et al. Finally for the detection of communities, we can mention Dahlin et al., Liu et al., Martin et al., Zhang et al. as well as our works (Benyahia et al.). In the frame of Graph Neural Networks, only few works focus on learning attribute-missing graph embeddings (Chen et al.). The applications on many downstream information communication, science and technology domains are nonetheless important, such as in databases (Banerjee et al.) and computer networks (Sasaki et al.). Thus, in this project our objectives are twofold. First, we will study the impact of the incompleteness of the relational data on network algorithms. Second, we will develop methods able to factor nodes and links incompleteness in graph processing algorithms as well as machine learning techniques based on graphs.
Profil du candidat :
The candidate should have a master degree or equivalent in Computer Science. The subject is at the intersection of several domains: graph theory, statistics, data mining and machine learning, big data. Thus the candidate should have strong backgrounds in several of these topics.
Other required skills:
• Good abilities in algorithm design and programming.
• Good technical skills regarding data mining, machine learning and data management
• A very good level (written and oral) in English.
• Good communication skills (oral and written).
• The ability to work in a team with colleagues from other scientific disciplines.
• Autonomy and motivation for research.
Formation et compétences requises :
Applicants are invited to contact as soon as possible.
The application file should contain the following documents:
1. a curriculum vitæ (CV);
2. the official academic transcripts of all the candidate’s higher education degrees (BSc, License, MSc, Master’s degree, Engineer degree, etc.). If the candidate is currently finishing a Master’s degree, s/he must send the transcript of the grades obtained so far, with the rank among her/his peers, and the list of classes taken during the last year;
3. some recommendation letters (quality is more important than quantity, there);
4. and a motivation letter written specifically for this position.
Send all of these documents by email to all the advisors:
• Christophe Gravier christophe.gravier@univ-st-etienne.fr
• Christine Largeron christine.largeron@univ-st-etienne.fr
Adresse d’emploi :
The PhD candidate will work at the Laboratoire Hubert Curien (UMR 5516) under the supervision of Christophe Gravier and Christine Largeron, Professor in Computer Science (LabHC – Université Jean Monnet, Saint-Etienne, France).
Document attaché : 202103081707_PHD-positionLabHC.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : LIS, Aix-Marseille Univ.
Durée : 3 years (2021-2024)
Contact : caroline.chaux@univ-amu.fr
Date limite de publication : 2021-05-14
Contexte :
The data deluge and the recent trends in machine learning result in the explosion in the size of the models, with possibly hundreds of billions of parameters [Brown et al., 2020]. Consequences of this phenomenon include major concerns: the difficulty to control those models in terms of design, training, interpretation, security; the need for large computational resources, for training but also for making predictions; the environmental impact that reaches unsustainable levels [Strubell et al., 2019].
Sujet :
The objective of this PhD project is to propose frugal models that are able to handle large volumes of data with efficiency while being structured to provide a reduced time and space complexity. As opposed to distillation techniques [Hinton et al., 2015] that are applied after training, the target structures are intrinsic to the proposed models. Either in deep neural networks or in other machine learning models, the space and time
complexity is mainly due to the linear part of the models, involving large matrices or tensors of data and parameters. A key challenge is to reduce this particular aspect. Appart from the well-known low-rank approaches [Mishra et al., 2013], one promising strategy is to decompose a large N × N matrix into a product of sparse matrices. Such models, named Flexible Multilayer Sparse Approximations [Le Magoarou and Gribonval, 2016] or butterfly factorizations [Dao et al., 2019, Vahid et al., 2020], inherit the structures of the fast transforms like the fast Fourier or Hadamard transforms. Consequently, their typical complexity is in O(N logN) in space and in time for matrix-vector multiplications. Preliminary works have shown how to leverage such models to revisit K-means in a frugal way [Giffon et al., 2021] and to use them for compressing neural networks [Giffon, 2020]. However, such models have not been well controlled so far, both in their ability to model arbitrary data and in their training procedure [Cheukam Ngouonou, 2020, Le Quoc, 2020, Zheng, 2020].
The expected works should contribute to the following research directions:
• proposing new frugal models, e.g., based on sparse/butterfly factorizations;
• studying the properties of such models (expressivity, frugality);
• developing learning algorithms for those models, with techniques including convex, nonconvex and combinatorial optimization;
• deploying such models in machine learning models (neural networks, kernel machines, and so on), tasks and use cases.
Profil du candidat :
Applicants should have excellent general skills in maths and computer science, ideally with some expertise in machine learning, optimization and signal processing. Some science popularization tasks will also be attached to this position.
Formation et compétences requises :
Master degree in maths and/or computer science
Adresse d’emploi :
ECM/LIS, Bureau 511, Bât. Equerre, 38 rue F. Joliot Curie, F-13013 Marseille
France
Document attaché : 202104221240_2021_PhD_offer_frugal_ML_models.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Sorbonne Center for Artificial Intelligence – Sorb
Durée : 36 mois
Contact : patrick.gallinari@sorbonne-universite.fr
Date limite de publication : 2021-05-15
Contexte :
Numerical simulation of fluids plays an essential role in modeling complex physical phenomena in domains ranging from climate to aerodynamics. Fluid flows are well described by Navier-Stokes equations, but solving these equations at all scales remains extremely complex in many situations and only an averaged solution supplemented by a turbulence model is simulated in practice. Unfortunately turbulence models present important weaknesses (Xiao and Cinnella, 2019). The increased availability of large amounts of high fidelity data and the recent development and deployment of powerful machine learning methods has motivated a surge of recent work for using machine learning in the context of computational fluid dynamics (CFD), and specifically turbulence modelling (Durasaimy et al., 2019). Combining powerful statistical techniques and model-based methods leads to an entirely new perspective for CFD. From the machine learning (ML) side, modeling complex dynamical systems and combining model-based and data-based approaches is the topic of active new research directions. This is then the context of this project, and our aim is to develop the interplay between Deep Learning (DL) and CFD in order to improve turbulence modeling and to challenge state of the art ML techniques.
Sujet :
Objective: Combining CFD models and Deep Learning
Our objective is to improve traditional CFD models, both in terms of complexity and of accuracy of the predictions, with the addition of ML components. Recent progresses, and the generalized use of automatic differentiation both for differentiable solvers and DL algorithms have paved the road to the integration of DL techniques and ODE/PDE solvers. In the ML community, a starting point for such investigations was the Neural ODE paper (Chen 2018) that promoted the use of ODE solvers for ML problems. We advocate for this research the use of DL modules for complementing CFD solvers, in the spirit of (Le Guen 2021) who introduced a principled approach however still limited to basic PDEs. In our new context, we will analyze how to model unclosed terms in the Reynolds-Averaged Navier-Stokes (RANS) equations. This approach can be seen as a generalization of classical closure models. In order to make easier this theoretical analysis, the approach will be first developed for a scalar surrogate of the Navier-Stokes equations, namely, the nonlinear Burgers’ equation, which has been widely used in the literature as a simplified ansatz for Navier-Stokes turbulence. The framework will then be deployed and adapted to the specificity of unsteady RANS simulations. Turbulence model agmentation will be achieved by supplementing classical closure models for which we have prior knowledge with data-driven corrections. The whole system will be trained end to end with the DL modules and the numerical solvers using high-fidelity data.
In order to be useful for CFD applications a learned model must accurately simulate flows outside of the training distribution: operational conditions and environment may vary according to different physical factors thus requiring models to extrapolate to these new conditions. DL could in principle be extremely efficient for learning complex dynamics but they struggle with generalization to out-of-distribution data. We will adopt a new perspective by considering learning dynamical models from multiple environments and propose a new framework leveraging the commonalities and discrepancies among environments. We expect this new setting to be more robust to new distributions than classical empirical risk minimization or robust optimization schemes.
References
Xiao H., Cinnella P. 2019, Quantification of model uncertainty in RANS simulations: A review. Progress in Aerospace Sciences, 108 (2019), 1-31.
Duraisamy K., Iaccarino G., and Xiao H., Turbulence modeling in the age of data, Annual Review of Fluid Mechanics 51, 357 (2019).
Chen, R.T.Q. et al., 2018. Neural Ordinary Differential Equations. NIPS (2018), 31–60.
Le Guen V., Yin Y., Dona J., Ayed I., de Bézenac E., Thome N., Gallinari, P. 2021. Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting. ICLR (2021), 1–22.
Schmelzer M., Dwight R.P., Cinnella P. 2020, Discovery of Algebraic Reynolds-Stress Models Using Sparse Symbolic Regression. Flow, Turbulence and Combustion 104 (2020), 579–603.
Gautier N. et al., 2015 Closed-loop separation control using machine learning. Journal of Fluid Mechanics, 770 (2015), 442–457
Arjovsky M. et al. 2020. Invariant Risk Minimization. arxiv.org/abs/2002.04692 (2020), 1–31.
Bhattacharya K. et al. 2020, Model reduction and neural networks for para- metric pdes. Preprint arXiv:2005.03180 (2020).
Kochkov, D. et al. 2021. Machine learning accelerated computational fluid dynamics. (2021), 1–13.
Um K. et al. 2020. Solver-in-the-Loop: Learning from Differentiable Physics to Interact with Iterative PDE-Solvers. Neurips (2020), 1–12.
Sirignano J., MacArt J.F., Freund J.B. 2020. DPM: A deep learning PDE augmentation method with application to large-eddy simulation. Journal of Computational Physics. 423, (2020), 109811.
Wang R. et al. 2020, Towards physics-informed deep learning for tur- bulent flow prediction, in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Dis- covery & Data Mining (2020), 1457–1466.
Willard J.D. et al. 2020. Integrating physics-based modeling with machine learning: A survey. arXiv (2020), 1–34.
Profil du candidat :
Master of science in engineering, computer science or applied mathematics
please send the following information to P. Cinnella and P. Gallinari. CV, motivation letter, grades obtained in master, recommandation letters when possible.
Formation et compétences requises :
Good background in mathematics applied to fluid mechanics, in computer science and in machine learning
Adresse d’emploi :
Sorbonne Université, 75005 Paris
SCAI – Sorbonne Center for AI –
The applicant will work jointly at Institut Jean le Rond d’Alembert (Fluid Mechanics) and at LIP6 (Computer Science)
Offre en lien avec l’Action/le Réseau : – — –/Doctorants
Laboratoire/Entreprise : IBISC – Informatique, BioInformatique, Systèmes Co
Durée : 3 ans
Contact : Khalifa.Djemal@ibisc.univ-evry.fr
Date limite de publication : 2021-05-16
Contexte :
Depuis quelques années, différents travaux de recherche scientifique ont démontrés que la qualité de l’air a un impact sur la santé et devient un sujet de plus en plus préoccupant à l’échelle urbaine. L’identification et la géolocalisation de sources de pollution atmosphérique est donc un enjeu important et repose sur l’utilisation d’un grand nombre de capteurs de gaz multimodaux fixes et/ou embarqués.
——————
In recent years, various scientific research studies have shown that air quality has an impact on health and is becoming an increasingly important issue at the urban area. Identification and geolocation of air pollution sources are therefore an important issue and rely on the use of a large number of fixed and/or on-board multimodal gas sensors.
Sujet :
En recherche scientifique, l’identification de sources polluantes repose sur la résolution d’un modèle inverse complexe mal posé au regard des données observées. La dispersion de polluants est généralement surveillée par des capteurs placés dans un domaine spatialement discret et fournissent des observations temporelles. Ces observations sont ensuite utilisées pour estimer les propriétés des sources de contaminants, par exemple leurs positions, leurs débits de rejet dans l’atmosphère et les paramètres du modèle régissant la dispersion de ces contaminants (par exemple la dispersion, la topographie du site, la météorologie, etc.). Ces estimations sont essentielles pour une évaluation fiable des dangers et des risques de contamination. Dans le cas particulier de plusieurs sources de contamination (avec des positions et des débits d’émission différents), les observations représentent un mélange ou une combinaison de deux ou plusieurs polluants.
Dans ce cadre, le travail attendu consistera en la résolution d’un problème de localisation de sources polluantes en environnement de type urbain avec un réseau de capteurs fixes et/ou mobiles. En effet, à partir de données optimisées, issues de campagnes de mesures existantes, c’est-à-dire des sources identifiées et localisées dans un environnement connu, il s’agira dans un premier temps, de mettre en œuvre un modèle d’apprentissage profond avec la prise en compte de manière active des différents paramètres des capteurs. Dans un second temps, le modèle construit avec une stratégie d’apprentissage actif, sera ensuite capable d’identifier et de donner une estimation de la position des sources polluantes dans un environnement inconnu.
——————-
In scientific research, the identification of pollution emission sources is based on the resolution of a complex inverse model that is ill-posed with respect to the observed data. Pollutant dispersion is generally monitored by sensors placed in a spatially discrete domain and provide temporal observations. These observations are then used to estimate the properties of contaminant sources, such as their positions, atmospheric release rates and the model parameters governing the dispersion of these contaminants (e.g. velocity, dispersivity, site topography, meteorology, etc.). These estimations are essential for a reliable assessment of the hazards and risks of contamination. In the particular case of several sources of contamination (with different positions and release rates), the observations represent a mixture or combination of two or more pollutants.
In this framework, the expected work will consist of solving a problem of multiple sources localization in urban/industrial environments with a network of fixed and/or mobile sensors. Indeed, using optimized data from existing measurement campaigns, i.e. sources identified and located in a known environment, this project will initially consist of implementing a deep learning model. In a second step, the model thus built, with an active learning strategy, will then be able to identify and give in an unknown environment, an estimation of the position and intensity of the emission sources.
Profil du candidat :
De niveau Master2 recherche ou équivalent, en Intelligence Artificielle (IA) et informatique ou Mathématiques appliquées (modélisation et calculs scientifiques).
——————
Master 2 research or equivalent, in Artificial Intelligence (AI) and Computer Science.
Formation et compétences requises :
La maîtrise des méthodes et des outils de traitement et analyse de base de données, des langages Python et C, sont vivement souhaités. Des connaissances de base en sciences de l’environnement atmosphérique seront également très appréciées.
—————
Knowledge of data processing methods and tools,
languages such as Python and C, is highly desirable. Basic knowledge of atmospheric environmental sciences will also be highly
appreciated
Adresse d’emploi :
IBISC -Université d’Evry Val d’Essonne
40 rue du Pelvoux
91000 Evry.
Vous pouvez candidater directement sur la plateforme ADUM:
https://www.adum.fr
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Laboratoire Bordelais de Recherche en Informatique
Durée : 36 months
Contact : meghyn.bienvenu@labri.fr
Date limite de publication : 2021-05-16
Contexte :
Accessing the relevant information contained in real-world data to support informed decision making is difficult, time-consuming, and error-prone due to the need to integrate data across multiple heterogeneous sources. Moreover, even if this first hurdle is overcome, a perhaps even more daunting challenge arises: how to obtain reliable insights from imperfect data? It is widely acknowledged that real-world data is plagued with quality issues, such as incompleteness (missing information) and errors (false or outdated information).
The ontology-based data access (OBDA) paradigm addresses the first challenge by facilitating access to (potentially heterogeneous) data sources through the use of ontologies that specify a convenient user-friendly vocabulary for query formulation (which abstracts from the way the data is stored) and capture domain knowledge that can be exploited at query time, via automated reasoning, to obtain more complete query results.
While OBDA systems are growing in maturity, they too often fail to address the data quality issue, aside from issuing warnings when inconsistencies are discovered. It is therefore essential to equip OBDA systems with appropriate mechanisms for handling imperfect data: how to obtain meaningful answers to queries posed over imperfect data, and how best to generate a high-quality version of the data?
This position is part of the INTENDED Chair on Artificial Intelligence, whose aim is to develop intelligent, knowledge-based methods for handling imperfect data. For more information, see: https://intended.labri.fr/
Sujet :
The PhD position will focus on the development of a customized user-sensitive approach to data quality in OBDA, in which users can give direction on how errors are resolved, based upon their knowledge, preferences, and intended use of the data.
More precisely, the student will define one or more notions of a data quality policy, examine the formal properties of such policies, and develop and analyze algorithms for constructing and debugging such policies and using them to clean the data and/or produce reliable answers to queries.
Depending on the interests of the student, the thesis could also involve a more practical component (implementation and testing of the developed algorithms), but the thesis is primarily focused on foundational research.
Profil du candidat :
As ontologies are expressed using logic-based formalisms, candidates should be familiar and comfortable with first-order logic.
Prior knowledge in one or more of the following areas would be a plus: knowledge representation and reasoning (especially description logics), database theory, Semantic Web (ontologies), theoretical computer science (in particular, computational complexity).
Strong English language skills (reading, writing, & speaking) are expected, but knowledge of French is not required. The working language can be either French or English.
Formation et compétences requises :
At the start of the PhD, the candidate must hold a Master’s degree in computer science (or possibly mathematics, if accompanied by relevant computer science experience).
Adresse d’emploi :
LaBRI – Université de Bordeaux
351 cours de la Libération
F-33405 Talence cedex
Document attaché : 202105051156_phd1-intended-4.pdf
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : IBISC
Durée : 3 ans
Contact : fariza.tahi@univ-evry.fr
Date limite de publication : 2021-05-17
Contexte :
The objective of the doctoral project will consist, from the initial work carried out in the IBISC and DAVID laboratories, to define, theoretically validate and then experimentally validate an approach for the prediction of the 3D structure of RNA-RNA and RNA-Protein complexes.
Most of the work in structural biology concerns protein molecules. But in recent years, RNAs, which constitute another type of molecules having, like proteins, a 3D conformation, have aroused growing interest. They have various functions, supposedly related to their shape and physicochemical properties, and also to their interactions with other molecules (proteins, as well as RNAs). Awareness of the existence of these non-coding RNAs during the last decade resulted in a renewed interest in studying their structure. For example, they are now being considered as possible therapeutic targets, as are various classes of proteins. Moreover, the determination of the complexes they form when they interact with proteins or with other RNAs would help to better understand their role in diseases like cancer. Many of such complexes, whose structures have been determined by experimental methods like crystallography or NMR, are available in the PDB database (https://www.rcsb.org). However, because of the important cost of these methods, computational methods are needed to make faster the discovery of complexes of RNAs and proteins, by proposing models allowing to predict potential structures that could be validated in a second step by experimental methods.
The computational methods that are proposed in the literature, like GARN2 [9] or RNAcomposer [19], are mostly interested in predicting the 3D structure of an RNA (often based on the energy which must be as low as possible) without taking into account its environment, i.e. its interactions with other RNAs or proteins. However, RNAs are very flexible molecules, and their 3D structure can vary under the effect of these interactions. Therefore, the 3D structure of a given RNA involved in a complex is not always that of minimum energy, since the stability of the complex (i.e. the global energy) is also required. Some methods are therefore proposed for taking into account these interactions, but in our knowledge, all are limited to predicting interaction between two RNAs such as Rascal [23], or between an RNA and a protein such as ICM [1].
Sujet :
In this thesis project we propose to develop methods based on game theory, which would take into account the interdependence between the 3D structure of RNA molecules and the interactions they have with each other and with proteins. We will consider that for each RNA, different possible 3D structures are predicted upstream by existing tools. We will also consider, as a first step, that potential interaction regions are known, eventually predicted. The problem will be modeled as a game in a graph G(V,E), where the vertices (the players) represent the 3D structures of RNAs or proteins, and the edges the possible interactions between the associated 3D structures. Each vertex, or agent, will represent a different RNA, and for each vertex v of V, we know a set of possible 3D structures S_v, and for each 3D structure we know a set of potential interactions areas that may be involved if the RNA represented by v interacts with other RNAs. Each player will have at his disposal 3 sets of actions: he will have to choose a configuration among a (large) set of configurations, a subset of adjacent edges to indicate with which other RNAs he decides to have an interaction, and for each selected edge, its potential interaction area among a set of potential interaction areas calculated beforehand. In order to better guide the search, it will be possible to introduce a distance on the set of 3D structures S_v, similarly we can know in advance that some edges of the graph G will never be used because the corresponding interactions are too weak.
We will look for complexes which are Nash equilibria. To compute Nash equilibria (stable solutions predicting complexes) in such a game theory approach, reinforcement learning and online learning techniques will be used. This approach has been previously used for the computation of 3D RNA structures [8,9]. For this, several algorithms exist,) [2], [3]. The main challenge, compared to classical models, is the definition of the utility functions for the players, that have on one hand to be calculated very quickly, and on the other hand to be sufficiently complex so that the Nash equilibria found are of good quality with respect to a global objective function (energy for example) that is too expensive, in computing time, to optimize. As a first approximation the utility function of a player could be equal to the energy of the configuration he has chosen plus the energies of the interactions with the other RNAs with which he has decided to interact (or a score function related to docking). Another possibility could be to use a multi-criteria approach along with game theory [13]. Indeed, we recently have shown that additional criteria (based on experimental data like SHAPE or based on the satisfaction of user constraints) could be used to improve the predictions of structures (see also BiORSEO [7]), considering insertion of 3D motifs as an additional criterion. The utility functions to be defined must also take into account, for each molecule, any spatial congestion of all of its neighboring molecules, for example by checking intersections of spheres approximating each 3D configuration. Another difficulty is that the actions that players take should have to be symmetrical (an interaction is considered only if both involved molecules choose it). This gives rise to generalized Nash equilibria [10], and the search for decentralized learning algorithms in such a context is a subject of research [24].
The first step of the doctoral project will therefore to finalize the game model described above. Then, it will be a question of correctly identifying and setting up the right online learning approach combined with local optimization methods, in order to make this distributed system converge towards equilibria close to realistic complexes. Finally, the approach finally retained and experimentally validated will be used to treat real cases of RNA-RNA and RNA-protein complexes available in PDB database (https://www.rcsb.org).
Références :
[1] Arnautova Y.A., Abagyan R., Totrov M., Protein-RNA docking using ICM, J. Chem. Theory Comput. 2018
[2] Auer P, Cesa-Bianchi N, Fischer P., Finite-time analysis of the multiarmed bandit problem, Machine learning, 2002; 47(23):235-256.
[3] Auer P, Cesa-Bianchi N, Freund Y, Schapire RE, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, 2002; 32(1):48–77.
[4] Barth D., Bougueroua S., Gaigeot M.-P., Quessette F., Spezia R., et al. Graph theory for automatic structural recognition in molecular dynamics simulations. The Journal of chemical physics 149 (18), 2018.
[5] de Beauchene IC, de Vries SJ, Zacharias M. Fragment-based modelling of single stranded RNA bound to RNA recognition motif containing proteins, Nucleic Acids Res. 2016;44(10):4565-4580.
[6] Becquey L, Angel E, Tahi F., RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures, Bioinformatics. 2020; btaa944.
[7] Becquey L., Angel E., Tahi F., BiORSEO: A bi-objective method to predict RNA secondary structures with pseudoknots using RNA 3D modules, Bioinformatics, 2020 btz962.
[8] Boudard M, Bernauer J, Barth D, Cohen J, Denise A., GARN: Sampling RNA 3D Structure Space with Game Theory and Knowledge-Based Scoring Strategies, PLoS One. 2015;10(8):e0136444. Published 2015 Aug 27.
[9] Boudard M, Barth D, Bernauer J, Denise A, Cohen J., GARN2: coarse-grained prediction of 3D structure of large RNA molecules by regret minimization, Bioinformatics (Oxford, England). 2017 Aug;33(16):2479-2486.
[10] Dutang C., Existence theorems for generalized Nash equilibrium problems: an analysis of assumptions, Journal of Nonlinear Analysis and Optimization, Sompong Dhompongsa and SomyotPlubtieng, 2013, 4 (2), pp.115-126.
[11] Engelen S, Tahi F. Tfold: efficient in silico prediction of non-coding RNA secondary structures, Nucleic Acids Res. 2010;38(7):2453-2466.
[12] Fortunel N.O., Chadli L, Coutier J, Lemaître G, Auvré F, Domingues S, Bouissou-Cadio E, Vaigot P, Cavallero S, Deleuze JF, Roméo PH, and Martin MT, KLF4 inhibition promotes expansion of adult human epidermal precursors and embryonic stem-cell-derived keratinocytes, Nature Biomed Eng, 2019, Dec;3(12): 985-997.
[13] A. Kagrecha et al., Constrained regret minimization for multi-criterion multi-armed bandits, arXiv:2006.09649, juin 2020
[14] Lamiable A., Barth D., Denise A., Quessette F., Vial S., Westhof E.: Automated prediction of three-way junction topological families in RNA secondary structures, Comput. Biol. Chem. 37: 1-5 (2012)
[15] Lamiable A., Quessette F., Vial S., Barth D., Denise A.: An Algorithmic Game-Theory Approach for Coarse-Grain Prediction of RNA 3D Structure, IEEE ACM Trans. Comput. Biol. Bioinform. 10(1): 193-199 (2013)
[16] Legendre A, Angel E, Tahi F., Bi-objective integer programming for RNA secondary structure prediction with pseudoknots, BMC Bioinformatics. Jan 15;19(1) :13. 2018.
[17] Legendre, A., E. Angel, and F. Tahi., RCPred: RNA Complex Prediction as a constrained maximum weight clique problem, BMC Bioinformatics, 2019 Mar 29;20(Suppl 3):128.
[18] Legendre, A., Ibéné M., E. Angel, and F. Tahi, C-RCPred: A multi-objective algorithm for interactive prediction of RNA complexes integrating user knowledge and probing data, to be submitted to ISMB’2021
[19] Popenda, M., Szachniuk, M., Antczak, M., Purzycka, K.J., Lukasiak, P., Bartol, N., Blazewicz, J., Adamiak, R.W., Automated 3D structure composition for large RNAs, Nucleic Acids Research, 2012, 40(14):e112
[20] Tav C, Tempel S, Poligny L, Tahi F. miRNAFold: a web server for fast miRNA precursor prediction in genomes. Nucleic
Acids Res. 2016;44(W1):W181-W184.
[21] Tempel S, Tahi F., A fast ab-initio method for predicting miRNA precursors in genomes, Nucleic Acids Res. 2012;40(11):e80.
[22] A Vulin, M Sedkaoui, S Moratille, N Sevenet, P Soularue, O Rigaud, L Guibbal, J Dulong, P Jeggo, JF Deleuze, J Lamartine and MT Martin. Severe PATCHED1 deficiency in cancer-prone Gorlin patient cells results in intrinsic radiosensitivity. Int J Radiat Oncol Biol Phys.2018,1;102(2):417-425.
[23] Yamasaki S, Hirokawa T, Asai K, Fukui K., Tertiary structure prediction of RNA-RNA complexes using a secondary structure and fragment-based method, J Chem Inf Model. 2014;54:672–682
[24] C. Yu, M. Van der Schaar and A. H. Sayed, Distributed Learning for Stochastic Generalized Nash Equilibrium Problems, IEEE Transactions on Signal Processing, vol. 65, no. 15, pp. 3893-3908, 1 Aug.1, 2017
Profil du candidat :
Candidats avec un niveau Master 2 ou équivalent (3eme année d’ingénieur).
Formation et compétences requises :
Formation de niveau M2 en Informatique (avec une certaine formation en biologie), ou en Bioinformatique / Biologie Computationnelle. Le candidat doit avoir une solide formation en algorithmique et en optimisation combinatoire. Une certaine expérience en bioinformatique structurale serait appréciée.
Adresse d’emploi :
Bâtiment IBGBI. 23 bv. de France. 91000 Evry, France
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : Laboratoire IBISC, Université d’Evry / Université
Durée : 3 ans
Contact : fariza.tahi@univ-evry.fr
Date limite de publication : 2021-05-17
Contexte :
Développement de méthodes computationnelles pour l’étude des ARN non-codants impliqués dans le cancer de vessie
In recent years, machine learning methods, particularly deep learning, have grown considerably, and have shown their effectiveness in a large number of fields, including biology and medicine. More and more bioinformatic methods and tools based on deep learning are proposed in the literature, to answer various biological and biomedical questions. In this project, we want to propose new deep learning methods for the prediction and analysis of particular genomic sequences, the non-coding RNAs (ncRNAs), in a biomedical context: their involvement in bladder cancer.
Non-coding RNAs, RNAs which do not code for proteins and constitute the largest part of genomes, are increasingly identified as playing important roles in the deregulation processes leading to pathologies, such as cancer (Anastasiadou et al., 2018). They are thus considered as potential diagnostic markers and therapeutic targets. Their identification and the determination of their function are important issues, and with the next generation sequencing (NGS) which generate considerable volumes of omics data, their prediction and their characterization by in silico methods is essential to make it possible to orient the experimental studies.
Recently, long ncRNAs (lncRNAs), larger than 200 nucleotides, have been identified as potential regulators. But unlike small ncRNAs, their characterization and classification by structure and function are far from established. Determining the structure of a lncRNA is a difficult problem, both by experimental (crystallography, NMR) and bioinformatic methods. Determining its function is even more difficult, especially since unlike proteins, ncRNAs with similar functions often lack sequence homology (RNA sequences show compensatory mutations maintaining structural conservation). Attempts to classify lncRNAs have been proposed, based on different criteria: length of transcripts, location, association with genes encoding proteins. A summary of these classifications has been proposed in (St. Laurent et al., 2015). In (Kopp and Mendell, 2018), the authors suggest a study of lncRNAs according to their location, explaining that this is often linked to function. But a large majority of published works is dedicated to the study of a precise lncRNA. For instance, a recent study (Uroda et al., 2019) reveals the importance of the presence of a pseudoknot (particular motif of the secondary structure) in the mechanism of regulation of the MEG3 lncRNA in the biological pathway of p53, gene involved in many cancers.
From a computational point of view, a few methods have been proposed in the literature for the classification of well characterized ncRNAs whose structure is well known. These methods, based on supervised machine learning, often of deep Learning type, offer a model built on a dataset composed of 13 classes of small ncRNAs. We can cite RNAcon (Panwar et al., 2014) based on the model of random forests and nRC (Fiannaca et al., 2017) based on convolutional networks (CNN), where the secondary structure is used for classification; then more recently ncRDeep (Chantsalnyam et al., 2020) based on CNNs and ncRFP (Wang et al., 2020) based on recurrent neural networks (RNN), both considering only sequence characteristics. Very few methods are specifically interested in the classification of lncRNAs. For example, SEEKR (Kirk et al., 2018) uses the sequence, more precisely the profiles of k-mers, to group the transcripts which are most similar and form a functional class, using a clustering algorithm based on a Pearson correlation (unsupervised learning). LncADeep (Yang et al., 2018) uses a deep neural network (DNN) to identify interactions between lncRNAs and proteins, based on sequence and secondary structure. The tool then uses the annotation of proteins associated with a lncRNA to describe the biological functions in which it is potentially involved. Although these methods make it possible to specify the broad category of lncRNAs, they remain limited. In addition, the classes summarized in (St. Laurent et al., 2015) are not all identified by the existing tools. We believe that it might be possible to more finely classify lncRNAs by taking into account other characteristics.
In this project, we propose to develop original computational methods based on Deep Learning (DL) to predict, classify and identify the function of ncRNAs, including the lncRNAs, by integrating different characteristics: sequence, structure (especially secondary), genomic and chromosomal position, interaction with coding or non-coding genes, and genetic and epigenetic alterations. Two methodological challenges are to be considered: (i) making it possible to consider heterogeneous characteristics (multi-source approach); (ii) predicting known classes of ncRNAs while being able to predict new classes, and this by combining a supervised approach with an unsupervised approach. An important point that we also consider concerns the visualization part of the results, for a better understanding and interpretation by the user.
Sujet :
We are interested by Self-organizing maps (SOM), which are unsupervised neural network capable of grouping and visualizing large-scale data. Using an unsupervised competitive learning algorithm, this technique is able to produce a map, representing the input space, in which nearby data is located in regions close to the map. In order to represent heterogeneous sources, we will propose original multimodal approaches based on DL which would allow to merge the different data sources. Fusion can be performed using three main strategies (Ramachandram and Taylor, 2017): early fusion, joint fusion and late fusion. Early merging involves combining the input characteristics of different sources before using a single DL model. Joint fusion refers to the process of combining representations of inputs learned at the intermediate layers of different neural networks that represent modalities. Late fusion allows the decisions of several neural networks that process modalities to be combined to provide a final decision. We will be particularly interested in joint fusion for the classification of ncRNAs and the identification of their biological functions. To take into account the different heterogeneous sources, each data source will be processed by an adequate DL model, such as “Convolutional Neural Neworks” (CNNs), “Graph Neural Networks” (GNNs) and multi-layer perceptrons (MLPs), which will allow better extraction of high level features from this source. To allow the discovery of new classes, we will study the association of different rejection options (Geifman and El-Yaniv, 2019) to the multimodal model. The combination of this model with SOMs (Platon et al. 2018) will allow the visualization of new classes of ncRNAs. We will also be interested in identifying the data sources and the characteristics that led to the predictions (Platon et al. 2018bis). This will make it possible to explain the predictions and to discover new properties that could be associated with ncRNAs.
Résultats attendus
Application, objectives and interest in cancerology research
The deregulation of ncRNAs may participate in tumor progression but the ncRNAs involved and their roles remain poorly defined. Clinically, ncRNAs can be diagnostic markers and therapeutic targets (Roberts et al., 2020).
Cancer in a given tissue is a heterogeneous disease composed of several subtypes, each subtype being defined by a specific transcriptional program. Genetic and epigenetic alterations as well as genes involved in cancer must be studied taking into-account these subtypes. In this project we will focus on bladder cancers (Tran et al., 2021) and in particular on the papillary luminal subtype and of the basal subtype bladder cancers for which the partner team has been major contributor. The luminal papillary subtype cancers are well differentiated and present, in the majority of cases, activating genetic alterations of the gene coding for the receptor tyrosine kinase FGFR3 and activation of the nuclear receptor PPARG (Biton et al., 2014; Mahé et al., 2018; Rochel et al., 2019; Shi et al. 2020). The basal subtype is a particularly aggressive subtype (most deaths will occur within one year after diagnosis), poorly differentiated, and found not only in bladder cancers but in many other carcinomas (breast, pancreatic, lung cancers for example) (Rebouissou et al., 2014; Kamoun et al., 2019).
The goals in the project are:
1) To systematically identify and classify the ncRNAs present in tumors of the papillary luminal and basal subtypes and compare them to the ncRNAs present in normal urothelium in different physiological states: different stages of differentiation, development or healing. This will allow us to accurately compare tumor cells to normal cells.
2) To identify deregulated ncRNAs in tumor cells compared to normal cells.
3) To determine the genetic or epigenetic mechanisms of their deregulation.
4) To predict the biological functions in which these ncRNAs are involved and their roles (target genes in the case of small ncRNAs, sponge function, participation in complexes and/or regulation of transcription in the case of long ncRNAs).
For this purpose, we will use the computational tools that will be developed in this project as well as other RNA bioinformatics tools developed in the AROBAS team (and available on the EvryRNA platform (http://EvryRNA.ibisc.univ-evry.fr), such as RNANet (Becquey et al., 2020), Biorseo (Becquey et al., 2020), RCPred (Legendre et al, 2019), or IRSOM (Platon et al., 2018)), miRNAFold (Tav et al. 2016), miRboost (Tran et al. 2015), and also the ones proposed in the literature.
The project will initially take advantage of molecular data already available (acquired by the partner team or public data): transcriptomics on whole samples and on single cells, genomic alterations, mutations, DNA methylation, histone modifications, DNA accessibility (ATAC-seq), DNA conformation (Hi-C). These data will be associated with clinical and pathological data. During this project, data will be acquired to complement the single cell analyses to take into-account the spatial organization of the tumor (spatial transcriptomics).
The proposed work will make it possible to advance our knowledge of bladder cancer, a cancer for which the therapeutic options in the case of the most aggressive forms remain limited. Since the studied subtypes are found in other cancers, the biological results obtained will have a general scope in oncology. RNAs are promising therapeutic targets, they are also diagnostic markers which can be very specific. In this thesis project the aim will therefore be, in addition to the development of original deep learning algorithms and original bioinformatics methods dedicated to RNAs, to help, thanks to the methods that we will develop, in the analysis and understanding of a health issue, here the bladder cancer, for a better therapeutic response.
Références :
-Anastasiadou E, Jacob LS, Slack FJ. Non-coding RNA networks in cancer. Nat Rev Cancer. 2018 Jan;18(1):5-18. doi: 10.1038/nrc.2017.99. Epub 2017 Nov 24. PMID: 29170536.
-Becquey L, Angel E, Tahi F. RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures. Bioinformatics. 2020 Nov 2:btaa944. doi: 10.1093/bioinformatics/btaa944.
-Becquey L, Angel E, Tahi F. BiORSEO: a bi-objective method to predict RNA secondary structures with pseudoknots using RNA 3D modules. Bioinformatics. 2020 Apr 15;36(8):2451-2457. doi: 10.1093/bioinformatics/btz962. PMID: 31913439. .
-Biton A, Bernard-Pierrot I, Lou Y, Krucker C, Chapeaublanc E, Rubio-Pérez C, López-Bigas N, Kamoun A, Neuzillet Y, Gestraud P, Grieco L, Rebouissou S, de Reyniès A, Benhamou S, Lebret T, Southgate J, Barillot E, Allory Y, Zinovyev A, Radvanyi F. Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes. Cell Rep. 2014 Nov 20;9(4):1235-45. doi: 10.1016/j.celrep.2014.10.035. Epub 2014 Nov 13. PMID: 25456126.
-Chantsalnyam, T., Lim, D. Y., Tayara, H., & Chong, K. T. (2020). ncRDeep: Non-coding RNA classification with convolutional neural network. In Computational Biology and Chemistry (Vol. 88). Elsevier Ltd. https://doi.org/10.1016/j.compbiolchem.2020.107364
-Fiannaca, A., La Rosa, M., La Paglia, L., Rizzo, R., Urso, A. (2017). NRC: Non-coding RNA Classifier based on structural features. BioData Mining, 10(1), 27. https://doi.org/10.1186/s13040-017-0148-2
-Geifman, Y. & El-Yaniv, R.. (2019). SelectiveNet: A Deep Neural Network with an Integrated Reject Option. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2151-2159
-Kamoun A, de Reyniès A, Allory Y, Sjödahl G, Robertson AG, Seiler R, Hoadley KA, Groeneveld CS, Al-Ahmadie H, Choi W, Castro MAA, Fontugne J, Eriksson P, Mo Q, Kardos J, Zlotta A, Hartmann A, Dinney CP, Bellmunt J, Powles T, Malats N, Chan KS, Kim WY, McConkey DJ, Black PC, Dyrskjøt L, Höglund M, Lerner SP, Real FX, Radvanyi F; Bladder Cancer Molecular Taxonomy Group. A Consensus Molecular Classification of Muscle-invasive Bladder Cancer. Eur Urol. 2020 Apr;77(4):420-433. doi: 10.1016/j.eururo.2019.09.006. Epub 2019 Sep 26. PMID: 31563503.
-Kirk, J. M., Kim, S. O., Inoue, K., Smola, M. J., Lee, D. M., Schertzer, M. D., Wooten, J. S., Baker, A. R., Sprague, D., Collins, D. W., Horning, C. R., Wang, S., Chen, Q., Weeks, K. M., Mucha, P. J., Calabrese, J. M. (2018). Functional classification of long non-coding RNAs by k-mer content. Nature Genetics, 50(10), 1474–1482. https://doi.org/10.1038/s41588-018-0207-8
-Kopp, F., & Mendell, J. T. (2018). Functional Classification and Experimental Dissection of Long Noncoding RNAs. In Cell (Vol. 172, Issue 3, pp. 393–407). https://doi.org/10.1016/j.cell.2018.01.011
-Legendre A, Angel E, Tahi F. RCPred: RNA complex prediction as a constrained maximum weight clique problem. BMC Bioinformatics. 2019 Mar 29;20(Suppl 3):128. doi: 10.1186/s12859-019-2648-1
-Mahé M, Dufour F, Neyret-Kahn H, Moreno-Vega A, Beraud C, Shi M, Hamaidi I, Sanchez-Quiles V, Krucker C, Dorland-Galliot M, Chapeaublanc E, Nicolle R, Lang H, Pouponnot C, Massfelder T, Radvanyi F, Bernard-Pierrot I. An FGFR3/MYC positive feedback loop provides new opportunities for targeted therapies in bladder cancers. EMBO Mol Med. 2018 Apr;10(4):e8163. doi: 10.15252/emmm.201708163. PMID: 29463565.
– Pachera E, Assassi S, Salazar GA, Stellato M, Renoux F, Wunderlin A, Blyszczuk P, Lafyatis R, Kurreeman F, de Vries-Bouwstra J, Messemaker T, Feghali-Bostwick CA, Rogler G, van Haaften WT, Dijkstra G, Oakley F, Calcagni M, Schniering J, Maurer B, Distler JH, Kania G, Frank-Bertoncelj M, Distler O. Long noncoding RNA H19X is a key mediator of TGF-β-driven fibrosis. J Clin Invest. 2020 Sep 1;130(9):4888-4905. doi: 10.1172/JCI135439. PMID: 32603313
-Panwar, B., Arora, A., & Raghava, G. P. S. (2014). Prediction and classification of ncRNAs using structural information. BMC Genomics, 15(1), 127. https://doi.org/10.1186/1471-2164-15-127
-Platon L, Zehraoui F, Bendahmane A, Tahi F. IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing maps with rejection. Bioinformatics. 2018 Sep 1;34(17):i620-i628. doi: 10.1093/bioinformatics/bty572.
-Platon L, Zehraoui F, Tahi F. Localized Multiple Sources Self-Organizing Map. ICONIP (3) 2018: 648-659.
-Ramachandram D. and Taylor, G. W. ‘Deep Multimodal Learning: A Survey on Recent Advances and Trends,’ in IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 96-108, Nov. 2017, doi: 10.1109/MSP.2017.2738401.
-Rebouissou S, Bernard-Pierrot I, de Reyniès A, Lepage ML, Krucker C, Chapeaublanc E, Hérault A, Kamoun A, Caillault A, Letouzé E, Elarouci N, Neuzillet Y, Denoux Y, Molinié V, Vordos D, Laplanche A, Maillé P, Soyeux P, Ofualuka K, Reyal F, Biton A, Sibony M, Paoletti X, Southgate J, Benhamou S, Lebret T, Allory Y, Radvanyi F. EGFR as a potential therapeutic target for a subset of muscle-invasive bladder cancers presenting a basal-like phenotype. Sci Transl Med. 2014 Jul 9;6(244):244ra91. doi: 10.1126/scitranslmed.3008970. PMID: 25009231.
-Roberts TC, Langer R, Wood MJA. Advances in oligonucleotide drug delivery. Nat Rev Drug Discov. 2020 Oct;19(10):673-694. doi: 10.1038/s41573-020-0075-7. Epub 2020 Aug 11. PMID: 32782413
-Rochel N., Krucker C.*, Coutos-Thevenot L.*, Osz J., Zhang R., Guyon E., Zita W., Vanthong S., Alba Hernandez O., Bourguet M., Al Badawy K., Dufour F., Peluso-Iltis C., Heckler-Beji S., Dejaegere A., Kamoun A., de Reyniès A., Neuzillet Y., Rebouissou S., Béraud C., Lang H., Massfelder T., Allory Y., Cianférani S., Stote R.H., Radvanyi F., Bernard-Pierrot I. (2019). Recurrent activating mutations of PPAR associated with luminal bladder tumors. Nat. Commun. 10, 253.
-Shi MJ, Meng XY, Fontugne J, Chen CL, Radvanyi F, Bernard-Pierrot I. Identification of new driver and passenger mutations within APOBEC-induced hotspot mutations in bladder cancer. Genome Med. 2020 Sep 28;12(1):85. doi: 10.1186/s13073-020-00781-y. PMID: 32988402.
-St.Laurent, G., Wahlestedt, C., & Kapranov, P. (2015). The Landscape of long noncoding RNA classification. In Trends in Genetics (Vol. 31, Issue 5, pp. 239–251). Elsevier Ltd. https://doi.org/10.1016/j.tig.2015.03.007 .
-C. Tav, S. Tempel, L. Poligny, Tahi F. miRNAFold : a web server for fast miRNA precursor prediction in genomes. Nucleic Acids Res. Jul 8 ;44(W1) :W181-4. 2016.
-VD Tran, S. Tempel, B. Zerath, F. Zehraoui, Tahi F. miRBoost : Boosting support vector machines for microRNA precursor classification. RNA. A Vol. 21, No. 5, 2015.
– Tran L, Xiao JF, Agarwal N, Duex JE, Theodorescu D. Advances in bladder cancer biology and therapy. Nat Rev Cancer. 2021 Feb;21(2):104-121. doi: 10.1038/s41568-020-00313-1. Epub 2020 Dec 2. PMID: 33268841.
-Uroda, T., Anastasakou, E., Rossi, A., Teulon, J. M., Pellequer, J. L., Annibale, P., Pessey, O., Inga, A., Chillón, I., Marcia, M. (2019). Conserved Pseudoknots in lncRNA MEG3 Are Essential for Stimulation of the p53 Pathway. Molecular Cell, 75(5), 982-995.e9. doi.org/10.1016/j.molcel.2019.07.025
-Wang, L., Zheng, S., Zhang, H., Qiu, Z., Zhong, X., Liu, H., Liu, Y. (2020). ncRFP: A novel end-to-end method for non-coding RNAs family prediction based on Deep Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1–1. https://doi.org/10.1109/tcbb.2020.2982873
-Yang, C., Yang, L., Zhou, M., Xie, H., Zhang, C., Wang, M. D., Zhu, H. (2018). LncADeep: An ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics, 34(22), 3825–3834. https://doi.org/10.1093/bioinformatics/bty428.
Profil du candidat :
Candidats avec un niveau de M2 (ou équivalent) en Bioinformatique, informatique ou Sciences des données.
Formation et compétences requises :
Des connaissances en biologie permettront d’interagir plus rapidement avec les biologistes impliqués dans le projet. Certaines de ces connaissances pourront être également acquises au cours des travaux de recherche. Une forte capacité d’adaptation (nouvelles méthodes, nouvelles thématiques) et une envie d’interagir avec des personnes de différentes spécialités sont requises.
Adresse d’emploi :
Bâtiment IBGBI. 23 bv. de France. 91000 Evry.
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : SIMAP / LIG
Durée : 3 ans
Contact : Emilie.Devijver@univ-grenoble-alpes.fr
Date limite de publication : 2021-05-19
Contexte :
Addressing climate change is among the most urging concerns in international policy. To this respect, the implementation of efficient carbon capture has been proposed as a means of enabling the continued use of fossil fuels in the near term, while renewable energy sources gradually replace our existing infrastructure. Metal-organic frameworks (MOFs) are three-dimensional porous materials that are recently attracting much attention as possible good candidates for an efficient carbon capture. The goal of this project is to computationally design optimal MOFs for an energy efficient carbon-capture- and-release. Specifically, an efficient CCR mechanism will be achieved by employing a change in the affinity for the gas (and thus a change in its uptake) upon an electronic transition induced by external stimuli.
Sujet :
A method combining machine learning and electronic structure simulations (DFT, many- body perturbation theory or quantum chemistry methods) will be developed, tested and employed to provide the first candidate set for the optimal materials. A second step will be performed to further tune and improve the properties of these materials for the desired application. Finally, in order to compare with existing good performer MOFs, the adsorption properties such as the working capacity will be computed for the best performers selected from the previous steps. Regarding the machine learning approach, the challenge is to develop a robust ML model that can provide highly predictive structure– property relationship using a small training set of high quality electronic structure simulations. The model will be developed on small molecules and then tested and used on databases of existing MOFs.
Profil du candidat :
We look for highly motivated candidates with a Master degree in Machine Learning Physics (or equivalent). A good knowledge of written and spoken English is essential to communicate with our external collaborators (US, Spain). The candidate should have some skills in programming languages (Fortran, C/C++, Python) and Linux. Basic knowledge of parallel computing will be appreciated.
Formation et compétences requises :
The deadline for sending your application is May 20th and interviews will be conducted before the end of May.
Applications include: a concise but informative cover letter, CV, Master 1 and Master 2 (or equivalent) marks, names and contact of at least two references that can be joined for recommendation letters.
Adresse d’emploi :
The PhD student will be located at SIMaP laboratory in Grenoble. SIMaP (https://simap.grenoble-inp.fr/) is a lab hosting scientists from different disciplines working on materials science using both experiments and simulations. The PhD is part of the multidisciplinary institute in artificial intelligence MIAI (https://miai.univ-grenoble-alpes.fr/en/multidisciplinary-institute-in- artificial-intelligence-academic-year-2020-2021-en-799001.htm). Two supervisors are located at SIMaP and a third one, Emilie Devijver, at LIG (https://www.liglab.fr/), the lab of informatics in Grenoble, which is located close to SIMaP, in the “campus universitaire”. Grenoble, the capital of the Alpes, offers an international and simulating environment for both leisure (mountain sport) and science. Regular seminars are organized by MIAI, SPF38, and other research centers such as ESRF and ILL.
Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : LIFO and GREYC/LS2N
Durée : 3 years
Contact : thi-bich-hanh.dao@univ-orleans.fr
Date limite de publication : 2021-05-21
Contexte :
Clustering is an important task in Data Mining, which aims at partitioning data instances into groups to find the underlying structure of the data. Clustering has been extended to constrained clustering, which allows to integrate prior expert knowledge in the form of constraints, in order to make the clustering task more accurate [1]. Most constrained clustering methods request the specification of all the constraints before the subsequent running of the methods. It is however crucial that the expert could interact with the clustering process, that he could inject new information or knowledge in the form of constraints on a clustering result. Constraints can be pairwise must-link or cannot-link constraints, which state that two instances must be or cannot be in the same cluster, or can be constraints on the clusters, stating bounds on their size or their diameter, or can be operations on clusters, such as split a cluster or merge two clusters, etc. The constrained clustering process therefore becomes incremental and interactive. However, these two properties are not well considered in existing constrained clustering approaches. This thesis aims to investigate these two research directions.
Sujet :
The thesis will be organized into two complementary parts. The objective of the first part is to propose new clustering approaches enabling to integrate incrementally new constraints identified as important by the expert or by measures, such as [6]. Declarative approaches based on Constraint Programming (CP) or Integer Linear Programming (ILP) will be considered due to their expressiveness and their rich constraint language. Meanwhile, in order to avoid confusing the expert, the new partition solution should not be too different from the previous one. This could be guaranteed based on a measure of clustering similarity, which can be either statistical [8] or more explanatory [5].
In the second part, we will consider a more user-centered and interactive clustering approach. This new paradigm stresses that users should be presented quickly with new generated constraints likely to be interesting to them (i.e., which may improve clustering quality in later iterations), by giving feedback. These feedback could be of the form validate / invalidate the constraints. In the context of mono-clustering, these constraints can be generated based on the information on an existing partition to identify informative points (e.g. frontier points). We will also consider the case where a set of clusterings is available, like in collaborative and multiparadigme clustering [2,7]. In such settings, one can use information from different clusterings to identify for instance uncertainty pairs or to elicit best objective functions according to some criteria to be defined. Here, a pair is more uncertain if more clusterings disagree on whether it should be in the same cluster or not. Another approach is to exploit the history of the feedback to determine most informative points [3]. Meanwhile, to prevent contradiction during the collection of the user feedback, consistencies on the learned constraints must be ensured.
The proposed method will be generic and will not depend on the potential areas of application. As part of the HERELLES project, in order to validate the operability of the method, we will focus on understanding complex phenomena in our environment (soil artificialization, urbanization, construction of infrastructure, etc.) mainly via heterogeneous temporal data.
Profil du candidat :
Machine Learning, Data Mining, Constraint Programming and Applied Mathematics
Formation et compétences requises :
Master or Engineering shool
Adresse d’emploi :
The PhD position will be conducted at LIFO, University of Orléans in collaboration with GREYC/LS2N, University of Caen Normandy.
The complete application consists of the documents below, which should be sent as a single PDF file to:
Thi-Bich-Hanh Dao (thi-bich-hanh.dao@univ-orleans.fr) LIFO, University of Orleans
Samir Loudni (samir.loudni@imt-atlantique.fr) IMT Atlantique – CNRS – LS2N
● Detailed CV
● One-page cover letter (clearly indicating available starting date as well as relevant qualifications, experience and motivation)
● University certificates and transcripts (both B.Sc and M.Sc degrees marks)
● Contact details of up to three referees
● Possibly an English language certificate and a list of publications
● Attention: all documents should be in English or in French.
Document attaché : 202104230758_PhD subject Orleans 2021.pdf
