link prediction in distributed knowledge graphs

When:
31/03/2022 – 01/04/2022 all-day
2022-03-31T02:00:00+02:00
2022-04-01T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : LORIA UMR7503 CNRS-Universtié de Lorraine
Durée : 6 mois
Contact : sabeur.aridhi@loria.fr
Date limite de publication : 2022-03-31

Contexte :
Today, vast and diverse sources of data exist for almost every scientific domain, making their integration and intelligent exploitation challenging. Indeed, complex data require expressive data representation models such as graph representation. The Linked Open Data (LOD) movement along with the FAIR (Findability, Accessibility, Interoperability, Reusability) data principles are intended to facilitate heterogeneous data integration and analyses. In the LOD context, graphs are called knowledge graphs as they encompass domain ontologies for typing objects and describing their relationships. Semantic web languages (RDFS, OWL, SPARQL) have reached an interesting level of maturity on which ambitious machine learning techniques can rely. Interestingly, big data and NoSQL solutions make possible web-scale data analyses. So far, such analyses on dedicated big-data architectures are often limited to MapReduce scenarios on rather simple data models (key-value oriented, homogeneous graphs with only one type of nodes and one type of edges). Graph databases, as one NoSQL approach, allow for rich representation of multi-typed attributed nodes and edges. This better expressivity comes with a cost as graph and program distribution is not an easy task.

The objective of this Master project is to make progress to the state-of-the-art of link prediction problem in knowledge graphs in a distributed setting [1][2][3]. We will mainly focus on link prediction approaches proposed by the CAPSID team to solve biological problems like drug discovery.
The proposed distributed approaches will be evaluated using web-scale knowledge graphs for inferring missing links (data completion). YAGO, DBpedia, and synthetic benchmarks are usable for such evaluation and validation purposes [4].

Sujet :
This Master thesis project aims to develop scalable link prediction methods in large and complex graphs. More specifically, the aims of this project are:

– to design scalable implementations of the studied approaches for distributed architectures. In this context, the use of big graph processing frameworks such as Pregel, Trinity, GraphLab and BLADYG need to be studied [5];
– to define evaluation and validation protocols for the proposed algorithms in the context of web-scale knowledge graphs;

This project will be carried out mainly within the Capsid team at INRIA Nancy which combines expertise in knowledge graphs and distributed graph computing (https://capsid.loria.fr).

Profil du candidat :
Candidates must have a bachelor degree in computer science, mathematics, or one of the physical sciences.

Formation et compétences requises :
Good programming skills in an object-oriented programming language such as JAVA or C++ are essential. Experience of NoSQL solutions (Neo4j, Titan, MongoDB), parallel/distributed programming (Spark, Hadoop, Flink) and graph processing frameworks (Pregel, GraphLab, GraphX) is also desirable but not essential.

Adresse d’emploi :
Laboratoire Lorrain de Recherche en Informatique et ses Applications
LORIA
Campus Scientifique
BP239
54500 Vandoeuvre les Nancy