On Capturing and Using Provenance in Machine Learning Pipelines

When:

24/03/2022 – 25/03/2022 all-day

2022-03-24T01:00:00+01:00

2022-03-25T01:00:00+01:00

Offre en lien avec l’Action/le Réseau : – — –/Doctorants

Laboratoire/Entreprise : LAMSADE
Durée : 5 à 6 mois
Contact : kbelhajj@googlemail.com
Date limite de publication : 2022-03-24

Contexte :
Machine learning pipelines are designed to generate predictive models given some raw data. Learned models are then utilized to make predictions given some (unseen) observations. The predictive power of the learned model depends largely on the data sets used for trained and how they have been preprocessed (engineered). ML-pipeline developers tend to rely mainly on their skills, past experience, and an iterative try-and-fail process to refine and improve ML.

Sujet :
We seek to investigate how provenance information can be utilized to improve the process whereby ML-pipelines are designed and refined. In particular, the sub-tasks of the internships are as follows:
*T1*. A sweep of the state-of-the-art of provenance in data preprocessing and machine learning.
*T2*. Identifying techniques for the collection and utilization of provenance with the view to assist ML developers in the task of designing, improving, and debugging ML pipelines.
*T3*. The implementation of a prototype, and it is validation in the context of real-world ML pipeline.

Profil du candidat :
The candidate must be a Master student or an engineering student in his/her final year of study. To apply, send your CV, a letter of motivation and transcripts of the last three years to kbelhajj@gmail.com and daniela.grigori@lamsade.dauphine.fr

Formation et compétences requises :
Familiarity with data processing as well as unsupervised and supervised machine learning algorithms

Adresse d’emploi :
Univertsité Paris Dauphine, Place du Maréchal De Lattre de Tassigny, 75016, Paris

Document attaché : 202202240950_Internship-MLPipelinesProvenance.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

On Capturing and Using Provenance in Machine Learning Pipelines