Schema Profiling of Massive Nested Key-Value Data and its Application to Effective Machine Learning

When:

01/06/2020 – 02/06/2020 all-day

2020-06-01T02:00:00+02:00

2020-06-02T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : – — –/Doctorants

Laboratoire/Entreprise : LAMSADE, Université Paris Dauphine
Durée : 36 mois
Contact : dario.colazzo@dauphine.fr
Date limite de publication : 01-06-2020

Contexte :
Nested Key-value data like JSON are very popular as they allow for overcoming the rigidity of relational databases by adopting flexible, schema-less models. This flexibility is a desirable property, especially when data is produced by uncontrolled sources, but it also complicates the processing and the analysis of data due to their variable structure. Major NoSQL systems like MongoDB [11], Couchbase [3], Apache Drill [1] and Spark [4] already adopt some schema extraction mechanism to reveal the structure of the data when it is loaded. However, the extracted schemas are purely structural and do not allow for expressing richer semantic constraints such as correlations or dependencies. At the same time, several machine learning framework [10, 5] support nested-value data formats for submitting training data.

In the literature, there has been some attempts for profiling relational data as witnessed by a recent survey [6]. In the context of JSON, data profiling is in its infancy and the only few approaches require to flatten the data before applying standard classification or clustering techniques devised for relational data [9] and [8]. Moreover, scalability is not addressed although JSON datasets are expected to be large and running classification or clustering algorithms may be prohibitive. Recently, Couchbase introduced a schema extraction module for classifying JSON documents based on their structure [3] using a kind of decision tree like in [9]. However, there is no clear understanding of the semantics of their classification approach since no formal documentation is available.

Sujet :
The first goal of this PhD project is to devise and study techniques for extracting constraints in a distributed fashion over large JSON datasets. A possible direction is to investigate the use of the distributed schema inference approach developed in [7] which allows for extracting statistical information about the structure of JSON datasets, by extending it in several directions, just to mention some of them : counting enumeration, constraints and statistics on simple values contained in records and arrays, tuple types and set operators like difference.

The second goal is to study means to exploit informative schemas for optimizing the data preparation phase of machine learning pipelines. This phase is acknowledged to raise a big challenge since extracting relavant features and transforming them in a way that is suitable for the target algorithm requires a good understanding of the underlying data. Without such an understanding, it is impossible to write complete extraction programs that account for all possible issues that can arise in the data like an incompatibility in the type or in the structure of data.
The third goal is to use of informative schemas for data exploration purposes. The idea is to guide users while formulating their queries for expressing meaningful feature extraction programs but also to inject some constraints expressed in the schema into the inference process itself.

Profil du candidat :
The current project lies in the intersection of three majors domains: data management, machine learning and type theory. Good proficiency in one these domains is sufficient but in general the candidate is expected to have good modeling and programming skills. The language of choice is usually one of: Java, Scala or Python. A good proficiency of database internals and systems in the Hadoop echo-system and in the Tensor Flow framework is desirable. The expected outcome of the thesis consists of both formal material and system development. Our goal is to apply the solutions of the problems described above in main- stream frameworks for shared-nothing parallelism and distribution like Apache Spark [4] or Apache Flink [2] but also for more specific systems like MongoDB [11] and Couchbase [3], when applicable. This entails that a study of recent approaches for optimizing JSON representation and storage in such frameworks to be carried on.

Formation et compétences requises :
Master Degree in Computer Science or equivalent degree.

Adresse d’emploi :
LAMSADE, Université Paris Dauphine
Contacts :
dario.colazzo@dauphine.fr
mohamed-amine.baazizi@lip6.fr

Document attaché : 202004271358_Thesis-PSL-2020.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

Schema Profiling of Massive Nested Key-Value Data and its Application to Effective Machine Learning