Question Answering With Open Knowledge Bases

When:
02/09/2023 – 03/09/2023 all-day
2023-09-02T02:00:00+02:00
2023-09-03T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : DOING/– — –

Laboratoire/Entreprise : SAMOVAR – Télécom SudParis
Durée : 6 mois
Contact : romerojulien34@gmail.com
Date limite de publication : 2023-09-02

Contexte :
Given a text, it is possible to extract from it knowledge in the form of subject-predicate-object triples, where all components of the triples can be found in the text. This is called Open Information Extraction (OpenIE). For example, from the sentence “The fish swims happily in the ocean”, we can extract the triple (fish, swims, in the ocean). By gathering many of these statements, we obtain an Open Knowledge Base (OpenKB), with no constraints on the subjects, the predicates, and the objects.

Then, this OpenKB could be used for question answering (QA). There have been many approaches that target QA over non-open KBs. These approaches vary from crafting query templates that, once filled in, will be used to query the KB, to neural models, where the goal is to represent the question and the possible answers as latent vectors, where the correct answer should be close in the embedding space to the question~cite{bordes2014question}. In this project, we will focus on neural models, particularly knowledge graph embeddings, i.e., continuous representations for the entities and relations that can generally capture relevant information about the graph’s structure.

The current way KB embeddings are computed raises two main challenges:
* Each entity and relation must be seen enough times during training so the system can learn relevant embeddings. The training is done taking edges information into account, so the entity or relation must be part of a sufficiently large number of edges.
* The textual representation of the verbal and noun phrases of the relations, subjects, and objects should be considered.

For example, a recent approach, MHGRN, computes embeddings by using a modified graph neural network architecture. This architecture, however, does not take into account the textual representation of relations.
A better approach is CARE, that relies on two main ideas. First, it clusters the subjects and objects and creates an unlabelled edge between entities in the same cluster. That partially reduces the problem of the entities connected to a small number of edges, by leveraging the connection with better connected entities. Then, it computes embeddings for the relations using GLOVE (word embeddings) and GRUs (recurrent neural networks). We believe that the approach in CARE could be improved by considering more modern neural architectures using message-passing algorithms and integrating the textual representation of predicates, objects, and subjects. In addition, we will investigate if the clustering step is necessary, as it can bring a bias for one important downstream application of KB embeddings: canonicalization, the task of finding a representative for a set of nodes or edges.

In this project, we will improve open KB embedding methods by:
* Exploring state-of-the-art neural architectures and language models.
* Integrating textual representations of the subject, predicate, and object.
* Investigating if clustering before embedding computation is necessary.
* Integrating embeddings into question-answering models.

Sujet :
Given a text, it is possible to extract from it knowledge in the form of subject-predicate-object triples, where all components of the triples can be found in the text. This is called Open Information Extraction (OpenIE). For example, from the sentence “The fish swims happily in the ocean”, we can extract the triple (fish, swims, in the ocean). By gathering many of these statements, we obtain an Open Knowledge Base (OpenKB), with no constraints on the subjects, the predicates, and the objects.

Then, this OpenKB could be used for question answering (QA). There have been many approaches that target QA over non-open KBs. These approaches vary from crafting query templates that, once filled in, will be used to query the KB, to neural models, where the goal is to represent the question and the possible answers as latent vectors, where the correct answer should be close in the embedding space to the question~cite{bordes2014question}. In this project, we will focus on neural models, particularly knowledge graph embeddings, i.e., continuous representations for the entities and relations that can generally capture relevant information about the graph’s structure.

The current way KB embeddings are computed raises two main challenges:
* Each entity and relation must be seen enough times during training so the system can learn relevant embeddings. The training is done taking edges information into account, so the entity or relation must be part of a sufficiently large number of edges.
* The textual representation of the verbal and noun phrases of the relations, subjects, and objects should be considered.

For example, a recent approach, MHGRN, computes embeddings by using a modified graph neural network architecture. This architecture, however, does not take into account the textual representation of relations.
A better approach is CARE, that relies on two main ideas. First, it clusters the subjects and objects and creates an unlabelled edge between entities in the same cluster. That partially reduces the problem of the entities connected to a small number of edges, by leveraging the connection with better connected entities. Then, it computes embeddings for the relations using GLOVE (word embeddings) and GRUs (recurrent neural networks). We believe that the approach in CARE could be improved by considering more modern neural architectures using message-passing algorithms and integrating the textual representation of predicates, objects, and subjects. In addition, we will investigate if the clustering step is necessary, as it can bring a bias for one important downstream application of KB embeddings: canonicalization, the task of finding a representative for a set of nodes or edges.

In this project, we will improve open KB embedding methods by:
* Exploring state-of-the-art neural architectures and language models.
* Integrating textual representations of the subject, predicate, and object.
* Investigating if clustering before embedding computation is necessary.
* Integrating embeddings into question-answering models.

Profil du candidat :
The intern should be involved in a master’s program and have a good knowledge of machine learning, deep learning, natural language processing, and graphs. A good understanding of Python and the standard libraries used in data science (scikit-learn, PyTorch, pandas, transformers) is also expected. In addition, a previous experience with graph neural networks would be appreciated.

Formation et compétences requises :
The intern should be involved in a master’s program and have a good knowledge of machine learning, deep learning, natural language processing, and graphs. A good understanding of Python and the standard libraries used in data science (scikit-learn, PyTorch, pandas, transformers) is also expected. In addition, a previous experience with graph neural networks would be appreciated.

Adresse d’emploi :
Palaiseau

Document attaché : 202302091340_internship_openie-1.pdf