Ingénieur de recherche – déploiement outils IA sur Cluster (H/F)

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : CRIStAL – UMR9189
Durée : 24 mois
Contact : clarisse.dhaenens@univ-lille.fr
Date limite de publication : 2023-07-21

Contexte :
En relation avec les ingénieurs du CRI, cet ingénieur aura pour mission de démocratiser l’utilisation du cluster Big Data acquis dans le cadre du CPER Data.
Pour cela, il :
– rencontrera des scientifiques non spécialistes de l’IA pour en comprendre leurs besoins
– procédera à la veille technologique sur les bibliothèques logicielles utilisées
– déploiera des outils d’IA (dans un premier temps issus des outils disponibles, dans un 2e temps, issus du WP1)
– mettra à disposition un environnement de programmation interactif connecté à une partie de l’infrastructure.

Sujet :
Déploiement d’outils d’IA sur clusters de calculs pour utilisation par non spécialistes.
2 étapes :
1. pré-identification de projets collaboratifs faisant intervenir des chercheurs en sciences du numérique et des chercheurs d’autres disciplines.
2. mise à disposition d’outils d’IA pour les non-spécialistes

https://emploi.cnrs.fr/Offres/CDD/UMR9189-CLADHA-002/Default.aspx

Profil du candidat :
Connaissance avancée en développement logiciel (langages C, C++, Python, Javascript)
Connaissance en intelligence artificielle, et notamment dans l’utilisation d’outils.
Connaissance en programmation distribuée, multi-coeurs….

Formation et compétences requises :
Niveau 7 – (Bac+5 et plus)

Adresse d’emploi :
VILLENEUVE D’ASCQ

CFP: Workshop on Artificial Intelligence for Predictive Maintenance and IIoT (AI4PMI) @IEEE AICCSA Egypt 4-7 Dec. 2023

Date : 2023-12-04 => 2023-12-07
Lieu : Le Caire, Egypte

[apologies if you receive multiple copies of this CFP]

Dear Colleagues,

Please find below a call for papers for Workshop on Artificial Intelligence for Predictive Maintenance and IIoT (AI4PMI) which will be co-located with the IEEE AICCSA 2023 conference, which will be held in the National Telecommunication Institute – Smart Village, Giza, Egypt, 4 – 7 December, 2023.

https://aiccsa-wsai4pmi1.gitlab.io/website/

Deadline for submissions: July 15, 2023 (AoE)

We are facing the 4th Industrial Revolution revolving around IoT, Edge Device and Machine Learning applications. While IoT is now part of our daily environment, these paradigms, combined together, open the door to a handful of new possibilities for predictive maintenance. They make this possible by enabling the Edge to “talk” and send real-time data.

Since predictive maintenance is aimed at finding the right balance between scheduled maintenance and curative maintenance, it requires the use of machine learning (ML) based solutions to explore and exploit the data generated. Innovative solutions are required which go beyond the current predictive maintenance systems by exploiting Artificial Intelligence techniques. This need to go beyond can be seen in the case of supervision systems where every new failure risk may not be predictable beforehand but with the use of machine learning, the decision process can be made more reactive to failures and more robust against attacks.

As this research area is still new, many scientific barriers need to be overcome and different challenges need to be addressed ranging from the data acquisition to the type of machine learning solution applied. Therefore, this workshop aims to bring together researchers, practitioners, and industry experts to discuss and explore the latest developments, methodologies, and applications of Artificial Intelligence techniques in predictive maintenance and IIoT. The primary goals of the workshop are to foster collaboration, exchange ideas, and promote advancements in this rapidly evolving field.

Topics include but not limited to:

– Data: acquisition & preprocessing, sensor fusion & data integration, benchmarks & datasets, simulations & digital twins
– Features: extraction, selection
– Targets: anomaly detection, fault diagnosis, fault prediction, root cause analysis, recovery protocols design, data privacy protection, knowledge capitalization
– Methods: deep learning, generative methods, explainability & interpretability, transfer learning, domain adaptation, real time algorithms, optimization, evolutionary algorithms, open-world machine learning, continual learning, symbolic AI, graph-based architectures (knowledge graphs, Graph neural networks, …)
– Edge computing: tiny ML, distributed architectures (federated learning, distributed learning, multi-agent system

These topics provide a comprehensive coverage of the technical challenges and advancements in machine learning for predictive maintenance and IIoT. They offer opportunities for researchers and practitioners to discuss their work, share insights, and collaborate on solving real-world maintenance problems.

Submission procedure

Please see this page for the submission instructions.
https://aiccsa-wsai4pmi1.gitlab.io/website/

Organizing Committee :

– Guillaume MULLER, École des Mines de Saint-Étienne, France
– Anaïs Lavorel, Université Claude Bernard Lyon 1, France
– Kamal Singh, Université de Saint-Étienne, France

Lien direct


Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

Offre de thèse dans le cadre du projet MOCKUP : Meteorological Observation ontologies and Contextual Knowledge for final User Policies

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : IRIT/CRNM
Durée : 3 ans
Contact : cassia.trojahn@irit.fr
Date limite de publication : 2023-07-01

Contexte :
Le développement durable est inscrit profondément dans les politiques publiques. L’une des conséquences en est la recherche d’outils techniques ou scientifiques pour réaliser cet objectif, et c’est ce que l’on appelle la science de la durabilité. Ce nouveau champ de la science est hautement interdisciplinaire avec une interaction entre les sciences de la société et de l’humain, et les sciences de la nature ou les sciences formelles. Dans ce contexte, les actions interdisciplinaires en sciences des données sur le pilotage de l’aménagement des territoires méritent d’être développées. Ce pilotage est en effet guidé par des données issues d’une variété de domaines (géographie, économie, environnement, etc.). Les points de vue sur ces données sont contextuels et évolutifs.

Le projet MOCKUP est donc centré sur l’observation et le pilotage du territoire, en s’appuyant sur une représentation sémantique des données entrantes et des points de vue sur ces données. Ses objectifs sont les suivants :
• l’apprentissage de points de vue selon l’usage des données, en particulier des données environnementales ;
• la représentation de points de vues pour définir des ontologies dynamiques et adaptables aux usages et contexte ;
• le raisonnement contextuel sur les données pour l’aide à la décision.

Sujet :
Dans ce projet, notre hypothèse est de considérer que l’appropriation de données décrites par une ontologie, dite de référence, passe par la prise en compte des usagers (et des usages), tant des publics ciblés que des contextes d’utilisation, dans la manière de présenter cette ontologie. Nous proposons pour cela de définir la notion de point de vue, considéré comme un prisme, une manière de présenter l’ontologie en partie ou en totalité, de manière adaptée à des utilisateurs et à un contexte d’usage.

Pour s’adapter au contexte d’usage, nous voulons donner un caractère dynamique, adaptatif et contextuel à l’ontologie, ce qui est négligé dans les travaux sur la construction des ontologies dans l’état de l’art. Bien que l’évolution des ontologies, pour leur donner un minimum de caractère “dynamique”, ait fait l’objet de nombreuses recherches dans la communauté Web sémantique, la problématique traitée ici est différente et nouvelle : l’ontologie serait stable et suffisamment riche pour être adaptée par de nouvelles instantiations donnant lieu à différentes vues sur celle-ci, selon les usages et les contextes.

Il s’agira de reformuler, simplifier ou extraire des sous-ensembles de l’ontologie, et si besoin d’envisager des présentations adaptées, et cela de façon adéquate aux contextes d’usage. Ces ontologies adaptatives peuvent donc être le pilier pour le raisonnement dépendant du contexte d’usage.

Profil du candidat :
Master en informatique (si possible avec mention), formé à la représentation de connaissances et aux technologies du web sémantique. Compétences en programmation, bonnes capacités de rédaction, y compris en anglais.

Formation et compétences requises :
Master en informatique (si possible avec mention), formé à la représentation de connaissances et aux technologies du web sémantique. Compétences en programmation, bonnes capacités de rédaction, y compris en anglais.

Adresse d’emploi :
Le la doctorant.e bénéficiera d’une allocation doctorale interdisciplinaire (ADI) cofinancée par l’Université de Toulouse et la région Occitanie (démarrage octobre 2023 pour 3 ans). La thèse sera co-encadré par Cassia Trojahn (IRIT, UT2J), Christophe Baehr (CRNM) et Nathalie Aussenac-Gilles (CNRS/IRIT).

Postdoctoral position: Machine learning for time series prediction in environmental sciences

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : LIFAT (EA 6300), Université de Tours
Durée : 18 months
Contact : nicolas.ragot@univ-tours.fr
Date limite de publication : 2023-09-30

Contexte :
The JUNON project, driven by the BRGM, is granted from the Centre-Val de Loire region through ARD program (« Ambition Recherche Développement ») which goal is to develop a research & innovation pole around environmental resources (agriculture, forest, waters…). The main goal of JUNON is to elaborate large scale digital twins in order to improve the monitoring, understanding and prediction of environmental resources evolution and phenomena, for a better management of natural resources.
JUNON will focus on the elaboration of digital twins concerning quality and quantity of ground waters, as well as emissions of greenhouse gases and pollutants with health effects, at the scale of geographical area corresponding to the North part of the Centre-Val-de-Loire region.
The project actors are: BRGM, Université d’Orléans, Université de Tours, CNRS, INRAE, and ATOS and ANTEA companies.

Sujet :
While the BRGM will have in charge to collect and arrange data (ground waters levels at different locations) and to benchmark predictions with mechanistic models as well as with classical prediction AI tools, the goal of the postdoc will be to build new prediction models able to integrate several sources of information like:
– meteorological data
– spatial information, i.e. geolocalization of sensors and locations of predictions to be made; topological information such as altitude
– integration of knowledge from mechanistic models as well as from expert knowledge (impact of attributes and variables used)
– etc.
The scientific locks are clearly related to:
– multivariate time series
– short-term to long term predictions (horizon)
– going from local predictors to ‘connected predictors’, i.e. how to use information coming from sensors spread over the area of study
And if possible:
– considering heterogenous data (time series, climatic data, topological information, combination with other models…)
– having an idea of how continuous learning (work of a PhD) could be done on such models.
Studying transformers and Spatio-Temporal Graph Neural Networks will be particularly investigated.
Of course, models will have to be implemented, learnt and compared with classical models on benchmarks.

Profil du candidat :
The candidate should be experimented with machine learning (Python) and time series.

Formation et compétences requises :
PhD in machine learning (computer sciences or applied mathematics)
Skills:
– a strong experience in data analysis and machine learning (theory and practice of deep learning in python) is required
– experiences/knowledge in time series prediction and environmental science is welcome
– curiosity and ability to communicate (in English at least) and work in collaboration with scientists from other fields
– ability to propose and validate new solutions and to publish the results
– autonomy and good organization skills

Adresse d’emploi :
Affiliation: Computer Science Lab of Université de Tours (LIFAT), Pattern Recognition and Image Analysis Group (RFAI)

Document attaché : 202307171132_Fiche de poste Pdoc Junon.pdf

CDD Ingenieur

Offre en lien avec l’Action/le Réseau : FedSed/Innovation

Laboratoire/Entreprise : EPIONE Inria
Durée : 24 mois
Contact : marco.lorenzi@inria.fr
Date limite de publication : 2023-09-30

Contexte :
As a research engineer, you will be attached to the Department of Experimentation and Development (SED) and to EPIONE Research Group. You will integrate the personnel staff for engineering innovation and Inria technology transfer and work on the development of the project Fed-BioMed. This project is the result of the synergy between research and engineering teams for the development of an open framework for federated learning in healthcare. The project is in collaboration with an international network of clinical collaborators and hospitals, that will provide clinical use cases for the translation of the development work into the real world. The project is based on Agile development methods, and is supported by the transfer, innovation and partnerships service (STIP) of the center, responsible for prospecting and helping the team in the implementation of research contracts and technology transfer.

At the end of this experience, you will have consolidated a broad range skills in software engineering with application to a unique scientific context, through the deployment of top-notch Data Science paradigm. This experience will allow you to consider careers as an engineer in research and development in national organizations ( Inria, INRAE, CNRS, CEA), industrial research centers, SMEs and digital start-ups.

Sujet :
The engineer will be mainly involved in the software developments within the framework of the project Fed-BioMed. She / He will also participate in the design and development of highly innovative scientific software platforms for scientific research and experimentation activities.
The engineer will also work on the co-development of prototype software integrating the technological core ideas proposed by the researchers. The engineer will work within a software development team using agile methodologies (mixed Scrum-Extreme Programming method), in a context where she / he will be able to constantly improve her / his skills.

Profil du candidat :
This position is intended for PhD, Post-docs or Engineers in the field of computer sciences (IT, image processing, robotics, bioinformatics, automation, simulation and high-performance computing), with a demonstrable background in machine learning and AI-related topics.

• Software development / software engineering.
• Solid experience in software development.
• Experience in project development, preferably using Agile methodology.
• Experience in Linux development
• Training and / or experience in one or more of the following fields: Data science / Statistics, Machine Learning, Artificial Intelligence, Optimization,
• ML libraries (e.g. Pytorch, TensorFlow, Keras, Julia),
• Medical data analysis,
• Database management;
• Knowledge / experience in an R&D environment (public or private sector).

Formation et compétences requises :
Skills / know-how:
• Programming languages: Python, C, C ++
• Experience with Machine Learning libraries (e.g PyTorch, TensorFlow, Keras, Julia)
• Know how to implement the methods and tools underlying the compilation, version control, continuous integration and development through testing in a context of agile methods
• Knowledge of agile methodology
• Good writing and communication skills
• Good level of technical and scientific English, both oral and written.
• Knowledge in one or more of the following tools is also a plus:
o version management, continuous integration, packaging and deployment (git, jenkins, cpack, conda, docker)
o graphical interfaces: Qt, Electron, Gtk, …

Benefits

• Subsidized canteen,
• public transport partially reimbursed
• Leave: 7 weeks of annual leave + 10 days of RTT (full-time basis) + possibility exceptional absence authorizations (e.g. parental leave, moving)
• Possibility of teleworking (after 6 months of seniority) and working time arrangement
• Professional equipment available (videoconferencing, loan of equipment, computers, etc.)
• Social, cultural and sports benefits (Association for the management of social works Inria)
• Access to vocational training
• Social Security

Salary

Depending to experience

Adresse d’emploi :
Inria Sophia Antipolis
2004 Route des Lucioles,
06902 Sophia Antipolis

Offre de thèse interdisciplinaire IRIT/CLLE (Toulouse)

Offre en lien avec l’Action/le Réseau : – — –/Doctorants

Laboratoire/Entreprise : IRIT/CLLE (Toulouse)
Durée : 3 ans
Contact : cassia.trojahn@irit.fr
Date limite de publication : 2023-09-30

Contexte :
Le projet RIMO cherche à construire le premier référentiel FAIR, conceptuel, terminologique et interdisciplinaire de la mémoire. Consultable au sein d’une plateforme en libre accès, il autorisera différents niveaux d’abstraction et d’interrogation. L’utilisateur (grand public, praticien ou spécialiste des sciences de la mémoire) pourra, en fonction de son expertise, exploiter les concepts de la mémoire en prenant différents points de vue. Ces points de vue pourront être des acceptations générales de ce que l’on entend par “mémoire” autant que des théories et outils spécifiques.

Sujet :
Les objectifs sont de :
(1) constituer un corpus textuel scientifique annoté à partir de textes issus des sous-disciplines de la science de la mémoire intégrées au champ d’application du projet. Ceci en s’appuyant sur un réseau de recherche (le GDR/Réseau Thématique Mémoire) et en exploitant une théorie cognitive des processus d’annotation ;
(2) exploiter, sur des corpus interdisciplinaires, des avancées récentes de l’intelligence artificielle dont l’apprentissage automatique et l’apprentissage par représentation afin d’extraire des termes et des relations entre eux. Les techniques du traitement du langage naturel, de l’extraction de connaissances utilisant des approches neuro-symboliques seront également mobilisées ;
(3) construire un modèle conceptuel du domaine de la mémoire au sein d’une ontologie qui doit tenir compte de différents points de vue et niveaux d’abstraction.

Profil du candidat :
Le projet de thèse proposé s’adresse principalement à un.e titulaire d’un Master en informatique, linguistique computationnelle, intéressé.e par les thématiques développées par l’IRIT et souhaitant construire une expertise à ce sujet (voir plus haut), tout en incorporant à son travail une expertise en psychologie sur les modèles à deux processus de mémoire (recollection, familiarité) et les représentations associées (représentation détaillée, représentation thématique).

Le projet proposé pourrait également convenir à un.e titulaire d’un Master en psychologie, déjà intéressé.e par les outils permettant d’étudier les processus de récupération en mémoire et les représentations associées, qui souhaiterait développer également des compétences en informatique (extraction de connaissances à partir de textes, construction d’ontologies, apprentissage automatique).

Formation et compétences requises :
(Voir profil)

Adresse d’emploi :
Toulouse

Offre de de thèse en co-tutelle (Perth – Australie à distribuer) – Attention date de dépôt des dossiers très proche

Offre en lien avec l’Action/le Réseau : – — –/Doctorants

Laboratoire/Entreprise : University of Murdoch (Australia) – Tetis (Montpel
Durée : 36
Contact : maguelonne.teisseire@teledetection.fr
Date limite de publication : 2023-09-30

Contexte :
Joint PhD position
Data-Driven Methods for Modeling the 3D Structure of Plants

Sujet :
The aim of this PhD thesis is to develop data-driven techniques for modelling the 3D structure of plants and analyze how plant structure is affected by various intrinsic and extrinsic factors such as soil conditions and environmental factors. This is an important problem that has a wide range of applications in plant biology and agriculture. One of the main scientific challenges is to develop efficient algorithms for the extraction of features and patterns from 3D point clouds representing plant shape. Another challenge is to develop models that can simulate the growth and development of plant structures over time, taking into account various environmental factors. Another scientific question addressed in this project is how to analyze the complex relationships between plant structure and function at different scales. This involves the development of methods to measure and quantify plant traits such as biomass, leaf area, and stomatal density, and to relate these traits to plant function and performance. Overall, the project aims to advance our understanding of the structure-function relationships in plants and to provide new tools for plant breeders, ecologists, and agronomists to improve crop productivity and resilience in the face of environmental challenges.
Keywords: Deep Learning, 3D computer vision, shape analysis, geometric modelling.

Profil du candidat :
Qualification: The successful candidate is expected to have a MSc degree (or equivalent), with a significant research component, completed by September 2023, with background in either image processing, computer vision, computer graphics, machine learning applied for vision, or 3D geometry processing. Students with background in mathematics, especially 3D geometry, are highly encouraged to apply.

Formation et compétences requises :

Experience: The ideal candidate should have some knowledge and experience in at least one of the fields listed above. The successful candidate should have strong programming skills.

As for generic competences, we seek a qualified self-motivated professional, open to multidisciplinary, with capacity to undertake independent research, ability to work in a teamwork, and self-motivated.

Language Skills: Fluent written and verbal communication skills in English are required.

Adresse d’emploi :
The candidate should also be willing to spend 18 months in Australia and 18 months in France.

Document attaché : 202307101258_TETIS_Murdoch_Joint_PhD_position_2023.docx – Google Docs.pdf

Demi-journée scientifique : explicabilité des modèles d’apprentissage automatique pour les signaux 2D/3D

Date : 2023-07-13
Lieu : Amphithéâtre Laura Bassi, INSA, Campus de la Doua, Lyon

Ou en virtuel :
https://cnrs.zoom.us/j/91931962448?pwd=eC9xTkNaeTlkN2Npd29zd2t4YnJNdz09

Les thèmes “Images et Informatique Graphique” et “Intelligence Artificielle et Apprentissage Automatique” (I3A) de la Fédération Informatique de Lyon (FIL) organisent une demi-journée d’échanges scientifiques sur l’explicabilité des modèles d’apprentissage automatique pour les signaux 2D/3D.

Programme :
Céline Hudelot (Professeur à CentralSupélec Paris, laboratoire MICS)
“Techniques d’explications visuelles pour comprendre les décisions de modèles de classification en imagerie médicale”

Sylvia Tulli (Maître de Conférences, Sorbonne Université, Paris, laboratoire ISIR)
“Explainable AI for Sequential Decision Making and Robotics”

Valentine Wargnier Dauchelle (Doctorante, laboratoire Creatis, Lyon)
“A Weakly Supervised Gradient Attribution Constraint for Interpretable Classification and Anomaly Detection”

Martin Blanchard (Doctorant, LHC, Saint-Étienne)
“Réseaux de neurones explicables: application à l’imagerie cellulaire avec ProtoPNet”

Sébastien Valette (Chargé de Recherche, laboratoire Creatis, Lyon)
“Disentangled representations: towards interpretation of sex determination from hip bone”

inscription : https://evento.renater.fr/survey/demi-journee-fil-xai-hyu9etrx


Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

Ecole d’été “Graphes et leurs applications au traitement de données”

Date : 2024-06-24 => 2024-07-01
Lieu : INSA de Rouen (76)

Les graphes sont au cœur de nombreuses thématiques de recherche ou apparaissent spontanément dans des projets où on ne les attendaient pas. Le traitement de ces objets peut mobiliser plusieurs domaines de recherche tels que la théorie des graphes, la modélisation de données sémantiques ou dynamiques, le traitement de données définies sur des graphes ou l’apprentissage de données définies sur ou par des graphes. Cette apparente disparité d’approches cache toutefois une utilisation de nombreux outils communs (e.g. plus grand sous graphe commun, pattern de graphes, métrique sur graphe…). L’objectif de la formation est de fournir une vue d’ensemble des approches détaillées ci-dessous pour fournir une vue d’ensemble du traitement de données sur graphes et permettre des échanges fructueux entre communautés scientifiques.

Lien direct


Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

Open PhD position on NLP at Sorbonne University jointly with ISIR (Paris, France) and IRL – ILLS (Montreal, Canada)

Offre en lien avec l’Action/le Réseau : SimpleText/Doctorants

Laboratoire/Entreprise : ISIR (Paris, France) and IRL – ILLS (Montreal, Ca
Durée : 3 ans
Contact : pablo.piantanida@centralesupelec.fr
Date limite de publication : 2023-08-31

Contexte :
Large Language Models (LLMs) a.k.a foundation models have greatly im- proved the fluency and diversity of machine-generated text. Indeed, the release of ChatGPT and GPT-4 by OpenAI has sparked global discussions on the effec- tive use of AI-based writing assistants. However, this progress has also introduced considerable threats such as fake news, and the potential for harmful outputs such as toxic or dishonest speech, among others. As it seems, the research on methods aimed at detecting the origin of a given text to mitigate the dissemination of forged contents and to prevent technology-aided plagiarism lags behind the rapid advancement of AI itself. For tasks like question-answering, it is essential to know when we can trust the natural language outputs of foundation models. Likewise, for tasks like machine translation, it becomes important to detect hallucinations or omissions, i.e., translations that either contain information completely unrelated to the input or that do not include some of the input information.
Recent works have indeed focused on tools that are able to spot such AI- generated outputs to identify and address these underlying risks. However, many of the existing approaches rely on pre-existing classifiers for specific undesired out- puts, which restricts their applicability to situations where the harmful behavior is precisely known in advance.
Statistical analysis of lexical distributions is a valuable approach for anomaly detection in natural texts. By examining the frequency distributions of words and phrases in a given text or dataset, statistical methods can help identify unusual or anomalous patterns that deviate from the norm and these anomalies may indi- cate potentially harmful outputs, may reveal the origin of a given text, or detect hallucinations, stylistic inconsistencies, or even malicious intent in the text. By leveraging statistical methods to analyze lexical distributions, this thesis will fo- cus on the automatic uncovering of deviations and anomalies that may indicate irregularities or unexpected patterns in natural language texts.

Sujet :
Forged texts and misinformation are ongoing issues and are in existence all around us in biased software that amplifies only our opinions for a “better”, more seamless user experience. On social media platforms, such software is used by rogue states, businesses, and individuals to create misinformation, amplify doubts about fac- tual data or tarnish their competitors or adversaries, thereby enhancing their own strategic or economic positions. This spread may be the result of different factors and incentives; however, each poses the same fundamental issue to humanity: the misunderstanding of what is true and what is false.

Leveraging deep learning models for large-scale text generation such as GPT-3 and GPT-4 has seen widespread use in recent years due to superior performance over traditional generation methods, demonstrating an ability to produce texts of great quality, with a coherence and relevance that is sometimes hard to distinguish from human productions. These models generate text via an auto-regressive procedure that samples from a distribution learned to mimic the ”true” distribution of human written texts. Malicious uses of these technologies thus constitute a major threat to truthful information.

Artificial text detection can be viewed as a special case of anomaly detection, broadly defined as the task of identifying examples that deviate from regular ones to a degree that arouses suspicion. Current research in anomaly detection largely focuses either on deep classifiers (e.g., out-of-distribution detection, adversarial attack) or relies on the output of large language models when labeled data is unavailable. Although these lines of research are appealing, they do not scale without requiring a large amount of computing. Additionally, these methods make the fundamental assumptions that (1) the statistical information needed to iden- tify anomalies is available in the trained model, (2) the model uncertainty can be trusted, which is typically not the case as illustrated in the presence of a small shift in the input distribution. LLM-based approaches do not perform well when used on large text fragments, as may be needed in practical applications (e.g., novel, story, or news generation), because of the fixed length context used when training the language model.

This Ph.D. thesis focuses on developing hybrid anomaly detection methods using deep neural network-based techniques and word frequency distributions that are linguistically inspired. Most of the research on language models to date fo- cuses on sentence-level processing and fails to capture long-range dependencies at the discourse level. Instead, we will leverage word frequency distributions and information measures to characterize long documents, incorporating a very large number of rare words, which often leads to strange statistical phenomena such as mean frequencies that systematically keep changing as the number of observations is increased. Advanced concepts from statistics and information measures are nec- essary to understand the analysis of word frequency distributions and to capture document-level information. We are expected to design and develop novel statistical models and algorithms specifically tailored for analyzing lexical distributions in natural texts. Extensive experiments on real-world data sets will be executed to showcase the viability of our approach, benchmark its performance, and analyze its advantages, limitations, and areas for improvement.

*Research questions*
Some potential research questions for our consideration are:

• How can lexical distributions be effectively modeled and represented in natural language texts?

• What information (statistical) measures and techniques can be derived to identify anomalies in lexical distributions?

• How can contextual information and linguistic features be integrated into anomaly detection models based on lexical distributions?

• Can unsupervised learning techniques be leveraged to detect anomalies without the need for labeled anomaly data?

• How can domain-specific knowledge and expert (or mechanical) feedback be incorporated into the anomaly detection process to improve performance?

This research will provide a deeper understanding of statistical analysis techniques for anomaly detection in natural texts and contribute to the development of more accurate and reliable methods for identifying unusual patterns in language usage.

*Team supervision*
Institut des Systèmes Intelligents et de Robotique (ISIR) and the International Laboratory on Learning Systems (ILLS) are looking for a student with a background in AI and Data Science, who gets inspired by sciences and the opportunities of data and AI to solve complex NLP problems. You have strong programming skills and a very good understanding of data science, statistics, and Machine Learning.

*An international and stimulating environment for research*
ILLS will promote international mobility between France and Canada to facilitate collaborations with Ph.D students and professors in Canada. The university partners in Canada are: McGill University and École de Technologie Supérieure (ÉTS), and the Quebec Artificial Intelligence Institute (Mila), which are major players in AI at the inter- national. They are involved in many research, industrial and academic projects. François Yvon, who will supervise this thesis at ISIR (Sorbonne Université), is a senior researcher at CNRS and a recognized expert in Automatic language pro- cessing, Machine translation, Speech recognition, Statistical language modeling, Document mining, Learning by analogy. Prof. Pablo Piantanida, who will super- vise this thesis on the ILLS (McGill – ETS – Mila) side, is a recognized expert in information theory and Machine Learning. One of the strengths of the partners, is first the high level of the international within the recently created International Research Laboratory ILLS of the CNRS in Montreal, allowing a highly dynamic and rich research environment in AI at large.

Profil du candidat :
• Very good understanding of Machine Learning theory and techniques.
• Good programming skills in Python (PyTorch).
• Applications/ domain-knowledge in natural language processing is a plus.
• Good communication skills in written and spoken English.
• Creativity and ability to formulate problems and solve them independently.

Formation et compétences requises :
• MSc program in Computer Science, Machine Learning, Computer Engi- neering, Mathematics, or related field (e.g. applied mathematics/statistics).

Adresse d’emploi :
https://emploi.cnrs.fr/Offres/Doctorant/UMR7222-YVEGER-002/Default.aspx?lang=EN

Document attaché : 202307091401_PhD_Topic_CNRS_ISIR_ILLS.pdf