Data Integration and Querying through Scalable Neural Data Representations for Data Lakes

When:
31/03/2024 – 01/04/2024 all-day
2024-03-31T01:00:00+01:00
2024-04-01T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : LIP6, Sorbonne Université
Durée : 6 mois
Contact : rafael.angarita@lip6.fr
Date limite de publication : 2024-03-31

Contexte :
Data lakes are collections of massive heterogeneous datasets hosted in a variety of storage systems. In contrast to data warehouses where the data has been transformed to answer specific queries, data lakes store raw unformatted data ranging from structured data such as relational tables, semi-structured data such as JSON documents, and unstructured data such as textual documents with no predefined schema or data model. Integrating such heterogeneous data is a crucial steps towards providing a unified and coherent view of the information within a data lake; however, traditional integration approaches still have difficulties when dealing with disparate data and fail at uncovering hidden relations within.

Neural data representations for databases are a novel approach for revealing hidden, latent information within the data using deep learning approaches. Some applications for queries over neural representations of data include fact-checking, table metadata generation, and content prediction in relational tabular data, as well as the discovery of missing links in knowledge graphs. However, neural data representations approaches cannot yet be applied to data lakes since they lack expressiveness to perform complex query and they do not handle large volumes of data efficiently

Sujet :
In this project, we aim to investigate and develop new methods for integrating and querying heterogeneous data within data lakes using deep learning models. This raises the following technical challenges: how to encode the semantics of heterogeneous datasets into the embedding learning process, reconciling datasets with different schemas and with incomplete and noisy data.

Internship goals and tasks:
• Literature review: Conduct a comprehensive literature review to understand existing methods and frameworks starting by the three categories presented above: Neural Tabular Data Representations, Knowledge Graph Embeddings, and Scaling Up Neural Representations of Databases.
• Data collection: Collection of a diverse range of heterogeneous data sources, including structured (e.g., tables) and unstructured data. For structured data, there exists several datasets such as WikiTables-TURL, WDC Web Table Corpus and VizNet. These datasets are used for different tasks such as question answering, semantic parsing, table retrieval, table metadata prediction and table content population.
• Scalable Querying of Neural Data Lakes: executing queries that necessitate the combination of results from these diverse neural data representations. This approach aims to deliver more complete answers, surpassing what can be achieved by querying each model in isolation.
• Comparative evaluation: Design experiments and benchmarks to evaluate the effectiveness of the proposed approach in generating embeddings for querying data lakes. Note that existing benchmarks are specific to certain downstream tasks such as question answering and fact checking for tabular data, and link prediction for knowledge graph; so the challenge of this tasks on designing a benchmark to test the intrinsic capabilities of neural representations of data lakes.

Profil du candidat :
Computer Science

Formation et compétences requises :
The candidate should have excellent experience in algorithmic and programming in Python and advanced knowledge in machine learning and relational and non-relational databases.

Adresse d’emploi :
LIP6, Sorbonne Université. 4 Place Jussieu75005 Paris.

Document attaché : 202312041116_Stage_LIP6_2024.pdf