Prototypage d’une librairie Python pour l’extraction d’information

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : Unité MaIAGE, INRAE, Université Paris-Saclay
Durée : 4-6 mois
Contact : arnaud.ferre@inrae.fr
Date limite de publication : 2023-03-06

Contexte :
L’extraction d’information est le domaine du Traitement Automatique des Langues Naturelles visant à extraire et à structurer automatiquement des informations contenues dans de grandes quantités de textes. Une extraction commence classiquement par une tâche de reconnaissance d’entité, puis peut être suivie par une tâche de normalisation d’entité (parfois nommée “entity linking/disambiguation” ou “concept normalization”) et/ou par une tâche d’extraction de relation.

L’équipe Bibliome de l’unité de recherche MaIAGE de INRAE/Université Paris-Saclay est spécialisée dans la recherche méthodologique en extraction d’information, notamment en domaines spécialisés. Elle développe également des solutions d’extraction pour des applications finalisées appliquées au domaine des sciences du vivant.

Encadrants : Arnaud Ferré et Louise Deléger

Sujet :
Aujourd’hui, la grande majorité des méthodes d’extraction sont codées en langage Python. Bien que commencent à apparaître certaines librairies standards pour le traitement automatique des langues naturelles et qui contiennent leurs structures de données (ex : Stanza [1] ou spaCy [2]), celles-ci ne représentent souvent pas suffisamment les objets manipulés spécifiquement en extraction d’information. Par exemple, elles ne contiennent pas de classes explicites nommées “mention” ou “concept”, basiques en normalisation d’entité, et bien qu’il existe une classe plus abstraite capable de représenter en particulier une mention, celle-ci ne peut pas être définie comme discontinue (ex : le groupe nominal “liver and pancreatic cancer” contient deux mentions distinctes dont la mention d’intérêt “liver cancer”, laquelle ne peut être représentée de façon discontinue). En conséquence, la plupart des chercheurs qui développent de nouvelles méthodes s’appuient encore sur des structures ad hoc adaptées à leurs tâches, mais peu partageables et posant même des questions en termes de reproductibilité.

Nous faisons l’hypothèse qu’une librairie standard définissant une structure de données plus spécifique, c’est-à-dire plus proche des besoins des méthodologistes en extraction d’information, permettrait une meilleure reproductibilité, une facilité de prise en main, et un gain de temps de développement et d’intégration des méthodes.

La/le stagiaire devra développer un prototype de librairie Python définissant des classes d’objets adaptées aux besoins des méthodologistes pour les tâches de reconnaissance et normalisation d’entité. Un premier travail de comparaison avec au moins une des librairies standards devra être mené. Si cela est pertinent, la librairie pourra être développée comme une extension d’une de ces librairies standards. Des méthodes de reconnaissance et de normalisation et des jeux de données d’évaluation seront mis à disposition pour permettre de mettre en place un cadre de développement expérimental. Ce travail passera par le développement de parseurs qui iront parcourir, analyser et extraire les éléments des fichiers de jeux de données (de différents formats) pour les instancier dans un programme grâce aux structures de la librairie développée. Dans un second temps, ce travail pourra être dérivé à l’extraction de relation.

Le stagiaire aura accès à un ordinateur fixe, aux serveurs de calculs du laboratoire, et, au besoin, à des infrastructures de calcul haute performance (ex : Lab-IA).

[1] Qi, Peng, et al. “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020.
[2] Honnibal, Matthew, and Ines Montani. “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.” To appear 7.1 (2017): 411-420.

Profil du candidat :
Etre formé(e) ou expérimenté(e) en traitement automatique des langues naturelles ou plus particulièrement en extraction d’information.

Autonome en programmation Python, notamment orientée objet.

Formation et compétences requises :
Master 2 / dernière année d’école d’ingénieur en informatique, linguistique ou TAL. Ouvert à d‘autres spécialités (ex : bioinformatique) selon expérience.

Adresse d’emploi :
Centre de recherche INRAE de Jouy-en-Josas (78)

Vers un modèle explicable pour la détection d’infox sur des données médicales basée sur des méthodes d’apprentissage profond

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : Laboratoire ICube, université de Strasbourg
Durée : 5-6 mois
Contact : stella@unistra.fr
Date limite de publication : 2023-03-06

Contexte :
Ce stage s’inscrit dans le cadre du projet DEEPISH (Deep lEarning ExPlainabilIty through Symbolic approacHes) mené au sein des équipes SDC (Science des Données et Connaissances) et CSTB (Systèmes Complexes et Bioinformatique Translationnelle) du laboratoire ICube. Ce projet a pour objectif de proposer un modèle général reposant sur des techniques de raisonnement symbolique, permettant d’expliquer les décisions de systèmes basés sur un apprentissage profond.

Sujet :
Ce travail de stage consiste à proposer une méthode de détection d’informations fallacieuses ou infox (“fake news”) issues de données médicales collectées sur internet. La détection se fera au moyen de méthodes de classification de textes, reposant sur des modèles de langue pré-entraînés à l’aide de grandes quantités de données textuelles ou modèles de “transformers” de type “BERT”. La détection devra s’accompagner d’un modèle d’explicabilité basé sur une conceptualisation des données extraites.

Profil du candidat :
Autonome, curieux, ayant un goût pour la modélisation de concepts, et pour la mise en œuvre de techniques d’apprentissage profond.
Bonne aptitude à la communication et aux échanges d’idées.

Formation et compétences requises :
En Master deuxième année ou d’un niveau équivalent dans une école d’ingénieurs, le ou la candidat.e devra avoir suivi une filière d’informatique orientée en science des données ou en intelligence artificielle. Il ou elle devra avoir une bonne maîtrise :
– des mécanismes de base de l’apprentissage profond (librairies TensorFlow, Keras, etc.),
– du langage Python.
– des méthodes de traitement automatique des langues (NLP),
– du raisonnement symbolique et de la modélisation de connaissances (règles logiques, ontologies, etc.).

Adresse d’emploi :
ICube UMR 7357 – Laboratoire des sciences de l’ingénieur, de l’informatique et de l’imagerie
300 bd Sébastien Brant – CS 10413 – F-67412 Illkirch Cedex

Document attaché : 202212051544_Sujet DEEPISH M2 2023.pdf

Explainable deep learning for Mild Cognitive Impairment detection with MR spectroscopy data

Offre en lien avec l’Action/le Réseau : – — –/Innovation

Laboratoire/Entreprise : XLIM, university of Poitiers
Durée : 5/6 months
Contact : olfa.ben.ahmed@univ-poitiers.fr
Date limite de publication : 2023-03-06

Contexte :
Alzheimer’s Disease (AD) is the most comment form of dementia. Neuroimaging data is an integral part of the clinical assessment providing a way for clinicians to detect brain abnormalities for AD diagnosis. Patients with AD suffer from the cognitive decline that leads to brain neurons and synaptic loss (i.e., memory loss, difficulty with problem-solving, etc.). Although there is currently no cure for AD, there are available medications that can slow down disease progression and improve the patient lifestyle. Recent studies on bio-markers research have demonstrated that the AD pathology is now suspected to start a long time before the manifestation of the clinical symptoms and even before brain damage. Hence, diagnosis of AD at earlier stages is of great clinical importance so that cognitive functions would be improved by medications and the spread of the disease would be prevented. Mild Cognitive Impairment (MCI) is an intermediary stage condition between healthy people and AD.
Detecting MCI subjects provide a potential window for early AD detection. However, MCI subjects’ detection remain a challenging clinical problem as it lies on a spectrum between NC and manifest AD. Therefore, identifying efficient bio-markers for early AD stages detection helps in establishing diagnosis and treatment strategies without delay. Over the last decades, imaging bio makers derived from anatomical Structural with machine learning techniques has been widely studied to assess brain atrophy for AD detection and prediction [1]. In addition to structural changes, metabolic changes in some brain regions could be a good biomarker for early AD detection [2]. However, the structural brain atrophy is not detectable at an early stage of the disease (namely for Mild Cognitive Impairment (MCI) and Mild Alzheimer’s Disease (MAD). Indeed, potential biological bio-markers have been proved their ability to early detect brain abnormalities related to AD before brain structural damage and clinical manifestation. Magnetic Resonance Spectroscopy (MRS) is a non-invasive technique providing a complementary approach to brain metabolism in vivo, during conventional MRI examination. MRS provides biological information of brain tissues at the molecular level allowing detecting brain abnormalities while MRI remains normal.

Sujet :
The goal of this internship is to:
• develop new deep learning based models for spectroscopy data classification for early AD detection, namely the MCI class detection.
• propose and implement a method for 1D Class Activation Map (CAM) generation for the 1D spectroscopy data for model interpretation. This task will the of a recently achieved work in our team [3]. The obtained 1D CAM should highlight the contributions of different MRS metabolites in the classification tasks. Data used in this internship are provided by CHU of Poitiers. In addition to the on MRS data, this data set contains multi-modal data of patients affected by different stages of AD (healthy elderly subjects, Mild Cognitive Impairment (MCI) and AD subjects)
Possibility to continue with a PhD proposal (starting in September/October 2023) in Artificial intelligence for medical images analysis

Location : XLIM (Site de Futuroscope), university of Poitiers in collaboration with the CHU of Poitiers
Tentative start date February/march 2023

Profil du candidat :
• Master 2 in computer vision, image processing, machine learning or any related field

Application : Send CV + transcripts and 2 reference letters to olfa.ben.ahmed@univ-poitiers.fr

Formation et compétences requises :
• Strong programming skills in python and deep learning frameworks (TensorFlow, pytorch)

Adresse d’emploi :
Location : XLIM (Site de Futuroscope), university of Poitiers in collaboration with the CHU of Poitiers

Analyse de données hétérogènes pour améliorer la prédiction d’indices de sécurité alimentaire

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : CIRAD – UMR TETIS
Durée : 6 mois
Contact : simon.madec@cirad.fr
Date limite de publication : 2023-03-06

Contexte :
Ce stage de Master s’inscrit dans le cadre du projet SCOSSA du programme TOSCA du CNES et dans la thématique générale de la sécurité alimentaire en Afrique de l’Ouest considérée comme l’un des enjeux majeurs de développement de la région.

Dans ce contexte, des données recueillies au travers d’enquêtes ménage représentent aujourd’hui une source d’informations fondamentales pour calculer les indicateurs de sécurité alimentaire qui sont ensuite utilisés en routine par différentes organisations. Ces indicateurs sont particulièrement difficiles à mettre en place dans les zones de conflit où les enquêtes ne peuvent se dérouler normalement.

Des études récentes se sont intéressées à l’estimation de ces indices à partir de données géospatiales et hétérogènes, en proposant des méthodes fondées sur l’utilisation des techniques avancées de science des données, et plus précisément d’apprentissage automatique et profond [1]. Ces approches permettent d’expliquer une part de la variation de la consommation alimentaire insuffisante et peut surpasser un modèle utilisant la prévalence comme estimation.

Des limitations existent encore et concernent notamment l’explicabilité des modèles (apprentissage par machine) ainsi que les performances et la validation de ces modèles face à des situations inédites : régions concernées par des conflits armés, périodes de crises économiques/inflation.

Sujet :
Au sein de l’UMR TETIS et en lien avec les équipes MISCA et ATTOS, l’objectif de ce stage est d’améliorer la performance des modèles d’apprentissage qui permettent d’estimer les indicateurs de consommation alimentaire.

Une première tâche sera la collecte et la mise en lien de données hétérogènes d’ordre économique et en lien avec les situations de conflits dans les régions d’intérêts [2].

Une deuxième étape sera d’analyser les résultats de simulation pour différentes entrées (données statique et non statique, d’ordre agronomique/ météorologique / économique…).

Des tests seront aussi réalisés sur d’autres régions / avec des données issues de nouvelles enquêtes [3].

[1] Deléglise, Hugo, et al. “Food security prediction from heterogeneous data combining machine and deep learning methods.” Expert Systems with Applications 190 (2022): 116189.

[2] Andree, Bo Pieter Johannes. “Estimating Food Price Inflation from Partial Surveys.” World Bank, Washington, DC (2021).

[3] https://microdata.worldbank.org/index.php/catalog/3768#metadata-version

Profil du candidat :
Compétences du candidat/e :

Connaissances/goût pour la programmation

Intérêt pour l’analyse de données

Rigueur scientifique

Curiosité et ouverture d’esprit

Capacité d’analyses, rédactionnelles et de synthèse

Informations complémentaires :

Durée de 6 mois, à partir de février 2023

Le stage se déroulera au CIRAD, dans l’UMR TETIS (Territoire, Environnement, Télédétection et Information Spatiale), située dans les locaux de la Maison de la Télédétection à Montpellier.

Encadrement

Simon Madec / Roberto Interdonato

Envoyer un CV et une lettre de motivation avant le 31/12/2022 à : simon.madec@cirad.fr

Formation et compétences requises :
Compétences du candidat/e :

Connaissances/goût pour la programmation

Intérêt pour l’analyse de données

Rigueur scientifique

Curiosité et ouverture d’esprit

Capacité d’analyses, rédactionnelles et de synthèse

Adresse d’emploi :
Maison de la Télédetection, 500 Rue Jean François Breton, 34090, Montpellier

Document attaché : 202212051459_Document.pdf

1er appel à communications IC @ PFIA 2023

Date : 2022-12-05 => 2023-07-07
Lieu : Strasbourg, France.

============================================
1er appel à communications IC @ PFIA 2023
============================================
Appel à communication IC 2023 (34es Journées Francophones d’Ingénierie des Connaissances)
dans le cadre de la plateforme PFIA 2023 (Plate-Forme de l’Intelligence Artificielle)
du 03 au 07 juillet 2023, à Strasbourg, France.

—————————————-
Présentation de la conférence
—————————————-
Les journées francophones d’Ingénierie des Connaissances (IC) sont organisées chaque année depuis 1997, d’abord sous l’égide du Gracq (Groupe de Recherche en Acquisition des Connaissances) puis sous celle du collège SIC (Science de l’Ingénierie des Connaissances) de l’AFIA. Cette année encore, IC est hébergée par la plateforme PFIA, conjointement avec d’autres conférences francophones dans le domaine de l’intelligence artificielle (IA).

L’ingénierie des connaissances peut être vue comme la thématique de l’Intelligence Artificielle accompagnant l’évolution des sciences et technologies de l’information et de la communication qui engendrent des mutations dans les pratiques individuelles et collectives. Elle ambitionne de contribuer à son essor en développant les modèles, les méthodes et les outils pour l’acquisition, la représentation et l’intégration de connaissances afin de rendre possible leur exploitation dans des environnements informatiques aux caractéristiques variées. La représentation formelle de ces connaissances permet des raisonnements automatiques sur ces connaissances et sur les données qui leur sont associées, pouvant être complexes, hétérogènes et évolutives. Sa finalité est la production de systèmes capables d’aider l’humain dans ses activités et ses prises de décisions.

La conférence Ingénierie des Connaissances réunit la communauté francophone et est un lieu d’échanges et de réflexions, de présentation et de confrontation des théories, pratiques, méthodes et outils. Cette communauté doit désormais prendre en compte l’essor des algorithmes d’apprentissage et leurs retombées sur les pratiques individuelles et collectives, tout en conservant l’humain au centre des systèmes de données et connaissances.

—————————————–
Thèmes de la conférence
—————————————–
Les propositions portant sur le thème « apports des graphes de connaissances pour les approches neuro-symboliques d’apprentissage automatique dans l’ingénierie des connaissances » seront particulièrement bienvenues. Nous encourageons également les propositions de communication sur des travaux, originaux ou déjà publiés à l’international, ayant une portée théorique, méthodologique ou pratique, sur l’un des thèmes listés ci-dessous (liste non exhaustive) :

Ingénierie des connaissances pour le Web
Stockage et interrogation de connaissances distribuées
Web sémantique, Web des données, Web social, Web des objets
Représentation des connaissances, ontologies
Modèles de connaissances : conception, évolution, évaluation, exploitation, cycle de vie
Modélisation et formalisation : langages formels et informels, standardisation
Méthodes et outils pour l’ingénierie ontologique : alignement, intégration, modularité, fusion, métriques, patrons de conception, visualisation
Conception et réutilisation d’ontologies fondatrices, ontologies de core-domaine, ontologies de domaine, interopérabilité, terminologies

De la donnée à la connaissance
Extraction et acquisition de connaissances, peuplement d’ontologies, annotation sémantique
Acquisition de connaissances à partir de textes, à partir d’images, à partir de données non structurées, à partir d’interactions
Ingénierie des systèmes collaboratifs, crowd-sourcing
Traitements et raisonnements sur les connaissances
Ingénierie des connaissances et fouille de données

Qualité des données et des connaissances
Ingénierie des connaissances et données complexes : données multimédia, multilingues, temporelles, spatiales, multi-échelles, imprécises ou incertaines
Propriété et sécurité dans les systèmes à base de connaissances
Provenance et confiance dans les données, détection de vérité, incertitude
Métrique et évaluation de la qualité des données et connaissances

Raisonnement et apprentissage
Inférences et règles métiers
Raisonnement logique, approximations, raisonnement statistique, raisonnement par analogie, raisonnement à partir de cas, raisonnement dans les logiques non classiques
Calcul de plongements de graphes de connaissances
Apprentissage profond et graphes de connaissances

Applications de l’Ingénierie des Connaissances et retours d’expérience
Recherche d’Information, indexation, recommandation
Interaction Homme-Machine : visualisation de données, de connaissances et interconnexions, interface avec un système à base de connaissances, explications
Agents conversationnels
Systèmes de recommandation à base de connaissances
Adaptation, personnalisation : profils utilisateurs, modèles de contexte et adaptation, modèles d’émotion
Traitement de données massives, hétérogènes
Applications aux sciences de la vie, à l’agriculture, la culture, l’éducation, l’industrie, l’économie, le droit, l’informatique décisionnelle (BI), etc.

—————————————-
Dates importantes
—————————————-
Soumission des articles : 1er mars 2023
Notification aux auteurs : 15 avril 2023
Réception des versions définitives : 15 mai 2023
Dates de la conférence : du 03 au 07 juillet 2023

—————————————-
Soumissions
—————————————-
L’appel à contributions de l’édition 2023 de la conférence IC comporte plusieurs types de communications :

Articles de recherche originaux (académiques ou applicatifs/industriels)
– Articles longs présentant des travaux originaux et validés (au maximum 10 pages références comprises, présentation orale 20 min, discussion 10 min)
– Articles courts présentant des travaux originaux ayant des résultats préliminaires (au maximum 6 pages références comprises, présentation orale 15 min, discussion 5 min)
– Posters et démonstrations accompagnés de résumés de 4 pages maximum références comprises (présentation pendant les séances posters/démos de la plateforme). Pour les démonstrations il est recommandé d’ajouter un lien dans le résumé vers une vidéo de démonstration de l’outil/logiciel.

Articles de positionnement
– Articles de positionnement apportant une rétrospective sur les travaux en lien avec un domaine bien identifié en lien avec les thématiques de la conférence, et proposant un point de vue sur les prochains verrous scientifiques importants de ce domaine (au maximum 6 pages références comprises, présentation orale 15 min, discussion 10 min)

Articles de recherche déjà publiés
– Articles déjà publiés dans de conférences ou revues internationales mais inédits en français. La soumission, obligatoirement en français avec une référence vers l’article publié (au maximum 2 pages références comprises).

Un prix du meilleur article sera décerné par le comité de programme pendant la conférence.

—————————————-
Comités
—————————————-
Présidente du comité de programme : Cassia Trojahn (Université Toulouse 2 Jean Jaurès, IRIT)
Comité de programme : en cours de constitution
Président du comité d’organisation : Thomas Guyet (INRIA)

Lien direct

Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

PhD Defense: Explainable Classification of Uncertain Time Series

Date : 2022-12-13
Lieu : ISIMA, Salle du conseil (A102) and visio

Hello,

I hope you are doing well.

I have the great pleasure to invite you to my PhD defense entitled Explainable Classification of Uncertain Time Series. The defense will take place on the 13th of December 2022 at 2 pm in room A102 (Salle du Conseil) at ISIMA. You are also invited to share some drinks and candies after the defense in the room A104 right after the defense.

How to attend remotely?There will be two channels to attend the defense remotely:
– By Microsoft Teams using this link: https://teams.microsoft.com/l/meetup-join/19%3ameeting_YWNmMDQ1MDAtYWFlOC00MDNjLWE3NTMtNjY5ODkxOTVhMDFm%40thread.v2/0?context=%7b%22Tid%22%3a%225a16bd04-b475-49ff-b11a-c6c8359db1b1%22%2c%22Oid%22%3a%22949eb4b9-6120-456f-95a8-6ec37948db76%22%7d
– By YouTube using this link: https://youtu.be/EW1Wp3Fg-1Q. Feel free to leave a thumb up if you like the presentation and a thumb down if you did not. I will also be happy to read any comment you may have about the presentation.

Here is the abstract of the presentation: Time series classification is one of the most studied theoretical and applied fields of time series analysis. Many classical machine learning as well as deep learning algorithms, have been developed during the last decade to accurately perform time series classification. However, the case where the time series are uncertain is still under-explored. In this work, we discuss the importance of uncertainty handling in machine learning in general and in time series classification in particular. We propose efficient, robust and explainable methods for the classification of uncertain time series. We assess our methods on simulated datasets, but also on a real scenario in the astrophysics in which uncertainty in preponderant. The results we obtained are understandable and trustable by astronomers. Our proposed methods are tools that will facilitate the understanding of the universe in which we life in particular, and the field of uncertain time classification in general.

Here is the composition of the Jury:
Anthony BAGNALL (R) – University of East AngliaSebastien DESTERCKE (R) – Heudiasyc, University of Technology of Compiegne

Elisa FROMONT (E) – IRISA, University of Rennes 1Emmanuel GANGLER (E) – LPC, University Clermont AuvergneDavid HILL (E) – LIMOS, University Clermont Auvergne
Themis PALPANAS (E) – LIPADE, Universite Paris CiteEngelbert
MEPHU NGUIFO (A) – LIMOS, University Clermont Auvergne(R): Reviewer, (E): Examinator, (A): Advisor

I am looking forward to defending my work in front of you.

Best regards

Lien direct

Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

ADBIS 2023 – call for tutorial proposals

Date : 2022-12-13
Lieu : Barcelona, Spain

Call for tutorials

ADBIS 2023 invites submissions for tutorial proposals on all topics of potential interest to the conference attendees. Tutorial proposals should cover state-of-the-art research, development, and applications in specific data management or information systems related areas, and stimulate and facilitate future work. Proposals must provide an in-depth survey of the selected topic with the option of describing specific works in depth.

The topic of the tutorial should be broad enough to attract a significant audience and must include enough details to provide a sense of both the scope of the material to be covered and the depth to which it will be covered. Tutorials on interdisciplinary directions, bridging scientific research and applied communities, novel and fast-growing directions, and significant applications, as well as tutorials with hands-on, are highly encouraged.

Important Dates

All deadlines below are AOE

Submission deadline: April 20, 2023
Notification: May 15, 2023
Camera-ready abstract overview due: June 15, 2023
Slides availability: September 3, 2023

Submission Guidelines

Tutorial submissions must be submitted electronically, in pdf format, to each of the tutorial chairs.

Submissions should be formatted using the LNCS style templates, with a maximum length of 8 pages, inclusive of ALL material. Any submitted paper violating the length, file type, or formatting requirements will be desk rejected.

Tutorials will be selected based on technical quality, significance of the topics, relevance to ADBIS.
Originality will be considered a plus. Accepted tutorials will be considered for publication in the conference or workshop proceedings.

Proposals should include:

Title of the tutorial
Names, affiliations and email addresses of the presenters
Overview of tutorial, with justification of its relevance and timeliness
Target audience and assumed background
Related recent tutorials and how the proposed tutorial is different or novel compared to those
Scope and structure: enough detail to provide a sense of both the scope of material to be covered and the depth to which it will be covered
Brief professional biographies of presenters, with a note on their background in the area of the tutorial

Authors of accepted tutorials are encouraged to provide their own recording of the tutorial, for dissemination purpose via the conference website. In any case, the presenters are expected to be there at the live event to give the tutorial – not just play a pre-recorded video.

Tutorial Chairs
Patrick Marcel
Boris Novikov

Lien direct

Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

FL-Day – Decentralized Federated Learning: Approaches and Challenges

Date : 2022-12-13
Lieu : Université Paris-Saclay
Amphithéâtre du bâtiment Digiteo (LISN)
Campus Universitaire, Rue Raimond Castaing bâtiment 650
91190 Gif-sur-Yvette

L’équipe ADAM du Laboratoire DAVID et l’Institut DATAIA co-organisent un workshop sur la thématique du Federated Learning, qui aura lieu à l’Amphithéâtre du bâtiment Digiteo (LISN) le mardi 10 janvier 2023 (Maps).

Inscription obligatoire & gratuite (dans la limite des places disponibles)
lien ici: https://www.dataia.eu/evenements/workshop-fl-day-decentralized-federated-learning-approaches-and-challenges

La journée abordera à travers plusieurs présentations, les problématiques liées à la thématique « Decentralized Federated Learning », de l’apprentissage automatique, au traitement de données décentralisées (Edge Computing) ou encore de la protection des données « privacy » dans un contexte décentralisé avec des illustrations dans différents domaines. Les présentations seront suivies d’une table ronde.

Les participants qui le souhaitent sont invités à proposer des Posters pour exposer leurs travaux pendant les pauses, en l’envoyant aux organisateurs ci-dessous :
Karine ZEITOUNI – karine.zeitouni@uvsq.fr
Zaineb CHELLY – zaineb.chelly-dagdia@uvsq.fr
Mustapha LEBBAH – mustapha.lebbah@uvsq.fr

Un buffet déjeunatoire ainsi que des pauses gourmandes seront prévus lors de cette journée.
==
Conférenciers invités :
==
AURÉLIEN BELLET – DR INRIA LILLE, ÉQUIPE CRISTAL

Titre : Better Privacy Guarantees for Decentralized Federated Learning
Résumé : Les algorithmes entièrement décentralisés, dans lesquels les participants échangent des messages de pair à pair le long des bords d’un graphe de réseau, sont de plus en plus populaires dans l’apprentissage fédéré en raison de leur évolutivité et de leur efficacité. Intuitivement, les algorithmes décentralisés devraient également offrir de meilleures garanties de confidentialité, puisque les nœuds n’observent que les messages envoyés par leurs voisins dans le graphe. Mais formaliser et quantifier ce gain est un défi : les résultats existants se limitent à des garanties de confidentialité différentielle locale (LDP) qui négligent les avantages de la décentralisation. Dans cet exposé, je présenterai des relaxations appropriées de la confidentialité différentielle et montrerai comment elles peuvent être utilisées pour montrer des garanties de confidentialité plus fortes pour le SGD décentralisé, correspondant au compromis confidentialité-utilité du SGD centralisé dans certains contextes. Il est intéressant de noter que certains de ces algorithmes amplifient les garanties de confidentialité en fonction de la distance entre les nœuds du graphe, ce qui correspond bien aux attentes des utilisateurs en matière de confidentialité dans certains cas d’utilisation.
—
SONIA BENMOKHTAR – DR CNRS, LIRIS, LYON

Titre : Decentralized Learning (as an enabler) for Decentralized Online Services

Résumé : Il y a un fort élan vers les services basés sur les données à tous les niveaux de la société et de l’industrie. Cela a commencé par des applications Web à grande échelle telles que les moteurs de recherche Web (par exemple, Google, Bing), les réseaux sociaux (par exemple, Facebook, Twitter) et les systèmes de recommandation (par exemple, Amazon, Netflix) et devient de plus en plus omniprésent grâce à l’adoption de dispositifs portables et à l’avènement de l’Internet des objets. Tous ces services sont rendus possibles par la disponibilité de grandes infrastructures de calcul, de forts progrès en matière d’intelligence artificielle (IA) et en particulier d’apprentissage automatique, et la possibilité de collecter et d’agréger de grandes quantités de données sur les utilisateurs, leurs environnements et leurs organisations dans des infrastructures de cloud. Mais si les progrès de l’IA/ML et des infrastructures distribuées ont été considérables, les applications axées sur les données rendues possibles par ces avancées posent des problèmes importants en ce qui concerne le respect de la vie privée de leurs utilisateurs et peuvent engendrer des menaces telles que la censure, la perte de contrôle des données personnelles et les fuites de données. Plus récemment, des initiatives telles que le Web 3.0 promettent de décentraliser les services en ligne, au cœur desquels l’IA/ML joue un rôle crucial pour donner aux utilisateurs la possibilité de reprendre le contrôle de leurs données personnelles et empêcher une poignée d’acteurs économiques de trop concentrer le pouvoir de décision.
—
HAKIM HACID – PRINCIPAL RESEARCHER, TII, ABU DHABI, UAE (GROUPE AIDRC)

Titre : Towards Edge AI: Principles, current state, and perspectives

Résumé : La communauté de l’intelligence artificielle (IA) a beaucoup investi pour développer des techniques capables de digérer de très grandes quantités de données pour en extraire des informations et des connaissances à valeur ajoutée. La plupart des techniques, en particulier les modèles d’apprentissage profond, nécessitent une grande puissance de calcul et de stockage, ce qui les rend appropriées aux environnements basés sur le cloud. L’intelligence est donc éloignée de l’utilisateur final, ce qui soulève des inquiétudes concernant, par exemple, la confidentialité des données et la latence. L’IA de périphérie vient apporter des solutions à certains problèmes inhérents au nuage et se concentre sur les meilleures pratiques, architectures et processus pour étendre l’IA des données en dehors du nuage. L’IA de périphérie rapproche l’IA de l’utilisateur final et utilise, par exemple, moins de ressources de communication, car le traitement est effectué directement sur le périphérique de périphérie. Cet exposé présentera l’IA de périphérie et donnera un aperçu des travaux existants et des futures pistes de contribution potentielles.

Au plaisir de vous y retrouver nombreux !

Bien cordialement,
Le comité d’organisation

Lien direct

Notre site web : www.madics.fr
Suivez-nous sur Tweeter : @GDR_MADICS
Pour vous désabonner de la liste, suivre ce lien.

24 months post-doctoral position: Deep learning strategies to model complex systems

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : IMT Atlantique
Durée : 24 months
Contact : carlos.granero-belinchon@imt-atlantique.fr
Date limite de publication : 2023-04-01

Contexte :
This project is multidisciplinary and focuses on the development of new Deep Learning models for non-linear multiscale description of complex systems. Applications in different topics such as fluid turbulence, remote sensing and ocean dynamics can be considered.

Sujet :
The main objective of this postdoc position is the formulation of multiscale DL models able to extract non-linear couplings. Moreover, we want this models to 1) be based on the physics of the studied system, and so to have a physics guided learning, and 2) to be interpretable from a physics point of view. With this purpose both the loss function and the model architectures will be adapted. Moreover, in order to emulate chaotic and complex systems a stochastic component will be included and the incertitudes of the reconstructed states quantified.

Thus, this project aims to reconstruct the unknown states of the studied complex system from physical knowledge of the system and available data that can be spatially distant, prior in time, at coarser resolution etc. We can then envisage physics-informed super-resolution, data generation and forecasting (Fablet et al. 2021) among other applications.

For example in the case of ocean modelling, the multiscale and non-linear nature of ocean surface dynamics plays a fundamental role in biogeochemical, ecological and climatic processes and consequently its characterization is a main topic in the current oceanographic research. Today the ocean dynamics can be studied through a large variety of remote sensing images of the ocean surface (Yahia et al. 2010, Renosh et al. 2015, Qiu et al. 2020) as well as from numerical simulations (Lellouche et al. 2021).

Profil du candidat :
Candidates are required to have a PhD in Deep Learning/Machine learning with strong experience in Neural Networks. The candidate must have passed at least 18 months in a non-French laboratory between May 1, 2019 and the start of the project.

Formation et compétences requises :
Good skills in python, pytorch, pytorch lightning are also required, as well as a background in teamwork. Previous experience in a multidisciplinary research team will also be considered as positive. Ideally (but not necessarily), the candidate will have previous experience in fluid physics and/or oceanography.

Adresse d’emploi :
The Postdoc will work in collaboration with Carlos Granero-Belinchon and Ronan Fablet from IMT Atlantique, Simon van Gennip from Mercator Ocean International, and Bertrand Chapron from Ifremer. Thus, the research team is composed by physicist, oceanographers and artificial intelligence researchers from different laboratories, leading to a multidisciplinary project. Moreover, the postdoc will develop within the OSE research team at IMT (https://cia-oceanix.github.io/) which is a dynamic research group on image processing and artificial intelligence for Oceanography and Climate. The postdoc will also be part of the new Inria team Odissey (https://team.inria.fr/odyssey/).

The post-doctoral position is a two-year full-time appointment starting during 2023. Gross salary will depend on the experience of the candidate, up to approx. 55,000 €/year (net salary: up to approx. 30,000 €/year). The candidate will also benefit from French social insurance, and will have up to 45 days of annual leave. The candidate will be able to benefit up to 90 days of remote working per year.
The candidate will be based at the IMT Atlantique Campus (Brest) in a dynamic and stimulating working environment at five minutes walking from the beach.
Within the framework of the ANR JCJC project SCALES the postdoc will have funding for participation in conferences, publication fees and visits to external laboratories. Moreover, within the framework of the ANR Chair OCEANIX the postdoc will have access to compute servers : Datarmor and servers from OSE at IMT Atlantique.
Teaching activities at IMT Atlantique will also be proposed to the postdoc, mainly in signal processing, computer vision and artificial intelligence. These actvities, which imply an additional salary, will not be mandatory.
Motivated candidates should send a CV and a motivation letter to: carlos.granero-belinchon@imt-atlantique.fr.

The Postdoc is expected to start during 2023.

Document attaché : 202212021346_Postdoc_ANR-SAD_v2.pdf

Data pipelines in the cloud: elastic execution with dynamic parallelism

Offre en lien avec l’Action/le Réseau : – — –/Innovation

Laboratoire/Entreprise : LIP6/Sorbonne Université et SAP France
Durée : 6 mois
Contact : bernd.amann@lip6.fr
Date limite de publication : 2023-04-01

Contexte :
Nowadays, institutions and companies manage their data with a wide variety of applications which were not designed to communicate with each other. On the other hand, there is a very strong need to design new data management and analysis services that will add value to the data that is there. Since it is practically impossible to migrate all applications and their data into an integrated system, the current solution is to build analytic data pipelines to facilitate the data flow between operations that perform complex processing, including collecting data from multiple sources, transforming it, generating AI models through learning, and storing it in multiple destinations. In practice, a data pipeline can contain hundreds of operations, and it can evolve repeatedly by being populated with new operations or new data. Thus, with the increasing number of pipelines to be designed and deployed, it is crucial to dispose of high level data pipeline definition languages, tools to deploy and control the execution of data pipelines and efficient solutions to optimize the execution of complex operations on large volumes of data.

In this context, SAP has developed the SAP Data Intelligence (DI) software for the automatic con- figuration and deployment of data pipelines. These pipelines use a flow-based programming model [3]. Each pipeline operation corresponds to a program (Python, node.JS, …) or a call to an external API (e.g., Spark job) that is deployed using an adapted Docker [2] image/container. Kubernetes services provide deployment and orchestration of these images on hyperscaler platforms like AWS, Google Cloud, Azure etc.

A performance problem arises at large scale when a pipeline contains long operations processing massive data. A first solution was designed in the context of an SAP/LIP6 internship to parallelize operators [4]. In this solution, the way to consume/produce data is described using data sorting and partitioning functions. This allows the data to be partitioned and distributed to process operators in parallel. The principle of the method is to first define the properties of a “divide and conquer” mapping in the JSON configuration of an operator. These properties allow to automatically transform a DI pipeline into a new parallelized DI pipeline with several replicas (identical copies) of the initial operator, each running in parallel on different parts of the operator’s input data. A “dispatch” operator is injected into the data pipeline to split the input data stream into different partitions and a “collect” operator is injected to aggregate the output of the replicas into a single output. The replicas are aggregated into a single output data stream. The first experiments show that this parallelization solution allows improving the performance of data pipelines, but does not allow obtaining optimal performance in real environments, which need to estimate and to dynamically adapt the operator replication/data parallelization degree in relation to the volumes of data exchanged, the calculations performed and the available resources.

Sujet :
The objective of this internship is to propose new methods to facilitate and optimize the deployment and execution of parallelized data pipelines. This raises several scientific and technical challenges:

• Estimating the replication degree: How many replicas should be deployed for each operation to be processed in parallel? To answer this question, we need to estimate the benefit of parallel processing as a function of the number of replicas, the amount of data to be processed and the CPU consumption of an operation. This benefit must also be related to the cost of using the machines running data pipelines in the cloud, in order to determine an optimal number of replicas for a certain budget.

• Elastic deployment: How can we adapt the number of replicas to dynamic changes in available resources and associated costs? This demands for new solutions to allow the number of replicas (degree of parallelism) of an operator to be dynamically changed without interrupting the pipeline.

Internship goals and tasks

Internship #1. The goal of the first internship is to evaluate the performance of the parallelization method on different types of stateful operators by varying the CPU load of the operator, the size of the operators state, and the size of the messages dispatched to the replicas. The evaluation will be run on a Kubernetes cluster deployed on a hyperscaler platform. Through this evaluation, we expect to learn the configuration parameters that provide the greatest parallelization benefit and some suggestions for improving the parallelization method.

Tasks:

• Propose a model to estimate the overhead incurred by adding operations that partition data and distribute it to replicas in the pipeline.

• Design a method to observe the execution of the pipeline and detect an overload (underload) situation.

• Determine the new degree of parallelism that will improve pipeline performance.

Internship #2. The goal of the second internship is to implement dynamic dispatch and collect operators which automatically adapt to the scaling up or down of the number of replicas of a parallelized operator. For the dispatch operator, the strategy must guarantee that no message is lost in case of scaling down. For the collect operator, the strategy must guarantee that all messages produced by the replicas are properly collected and possibly re-ordered in case of scaling up.

Tasks:

• Design a technical solution to dynamically change the number of running operator replicas and adapting the dispatch and collect operators.

• Conduct experiments using data pipeline examples to check the validity of the implemented strategies and measure their possible overhead.

The solutions will be deployed in the SAP DI environment. Comparative experiments will be implemented on the Spark parallel computing platform. For this, a solution will be designed to transform the pipeline description (written with Data Intelligence syntax) into a Spark pipeline [1] (pyspark syntax).

References

[1] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL: relational data processing in spark. In ACM SIGMOD International Conference on Management of Data, pages 1383–1394, 2015.

[2] David Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, 1(3):81–84, 2014.

[3] Tanmaya Mahapatra. High-level graphical programming for big data applications. Master’s thesis, Technische Universität München (TUM), 2019.

[4] Ludgy Vestris. Scaling up stateful and order preserving operators in DI data pipelines. Master’s thesis, CNAM, SAP – LIP6, 2022.

Profil du candidat :
The candidate should have excellent experience in algorithmic and programming (Python, Java) and advanced knowledge of optimization and parallelization techniques (query optimization, data parallelism, map-reduce, ….) and some technical knowledge of Docker/Kubernetes is also helpful. To apply, you should send to the three co-supervisors (see email above), a CV and the grades of the last three semesters of study.

Formation et compétences requises :
Dernière année de Master ou d’École d’ingénieur

Adresse d’emploi :
• SAP France (Levallois-Perret)
• Equipe Bases de Données du LIP6 (Paris): http://www-bd.lip6.fr/

Document attaché : 202212021339_Stage_LIP6_SAP_2023-3.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

Archives

Prototypage d’une librairie Python pour l’extraction d’information

Vers un modèle explicable pour la détection d’infox sur des données médicales basée sur des méthodes d’apprentissage profond

Explainable deep learning for Mild Cognitive Impairment detection with MR spectroscopy data

Analyse de données hétérogènes pour améliorer la prédiction d’indices de sécurité alimentaire

1er appel à communications IC @ PFIA 2023

PhD Defense: Explainable Classification of Uncertain Time Series

ADBIS 2023 – call for tutorial proposals

FL-Day – Decentralized Federated Learning: Approaches and Challenges

24 months post-doctoral position: Deep learning strategies to model complex systems

Data pipelines in the cloud: elastic execution with dynamic parallelism