Injecting Knowledge into Multimedia Entity Representation

When:
03/10/2021 – 04/10/2021 all-day
2021-10-03T02:00:00+02:00
2021-10-04T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : DOING/– — –

Laboratoire/Entreprise : CEA List and LISN
Durée : 18 months
Contact : herve.le-borgne@cea.fr
Date limite de publication : 2021-10-03

Contexte :
Exploiting multimedia content often relies on the correct identification of entities in text andimages. A major difficulty for understanding a multimedia content lies in its ambiguity with regard to the actual user needs, for instance when identifying an entity from a given textual mention or matching a visual object to a query expressed through language [ADJ20a,ADJ20b].

The MEERQAT (https://www.meerqat.fr) project addresses the problem of analyzing ambiguous visual and textual content by learning and combining their representations and by taking into account the existing knowledge about entities. It aims at solving the Multimedia Question Answering (MQA) task, which requires answering a textual question associated with a visual input like an image, given a knowledge base (KB) containing millions of unique entities and associated text. The post-doc specifically addresses the problem of injecting knowledge into multimodal entities to be able to answer questions that relate to them. Other partners of the project work on the visual, textual and KB representation, and on the entity disambiguation.

Sujet :
Entities can be represented by different modalities, in particular by visual and textual content. In a common space, an entity can thus be represented by several vectors, that need to be combined into a unique representation that reflects the similarity of the related entities. In such a context, a promising approach consists of learning a visual representation from natural language supervision [RAD21] relying on large datasets by a simple learning strategy based on contrastive predictive coding [OOR18], adapted to text and visual modalities [ZHA20]. The learned representation allows to address multiple cross-modal tasks and provide a large-scale vocabulary that is adapted to general audience in a given language. It exhibits state of the art performance on several tasks and can even exceed humans on certain tasks. However, it does not include any structural information from a knowledge base. The main task of the post doc will thus consist in injecting such prior knowledge into the entity representation to address Multimedia Question Answering. Some approaches were recently proposed to do so in the context of caption generation [GOE20].

We consider entities such as persons, places, objects or organizations (NGOs, companies…). Depending on the type of an entity, the information to take into account in its representation is not obvious. If a person can probably be associated with a couple of mentions and images, it becomes less obvious for other types of entities. For instance, a company can be associated with its logo, but also with its main products or even its main managers (CEO, CTO . . . ). In the same vein, a location can be represented by many pictures, and a large one such as a city by some emblematic buildings or places. The second task of the post-doc will consist to determine the appropriate information to include in the representation of a given entity, depending on its type.

Profil du candidat :
* PhD in Natural Language Processing, Computer Vision, Machine Learning or other relevant fields
* Strong publication record, with accepted articles in top-tier conferences and journals of the domain
* Solid programming skills (pytorch/tensorflow). Publicly available project will be appreciated
* Ability to communicate and collaborate at the highest technical level
* Experience on using GPUs on a supercomputer (e.g. with SLURM or similar tool) will be appreciated

Formation et compétences requises :
PhD in Natural Language Processing, Computer Vision, Machine Learning or other relevant fields

Adresse d’emploi :
The post-doc will be supervised by CEA and LISN. The candidate will be hired by CEA (Palaiseau, near Paris, France) for a 18-months post-doc. The LISN is located close to CEA on the Paris-Saclay University Campus.

The salary depends on qualifications and experience. This will include social coverage (health, unemployment, retirement).

The postdoc will have access to large supercomputers equipped with multiple GPUs and large storage for experiments, in addition to a professional laptop.

To apply to the position, send a CV (including publication list or a URL pointing to it, such as Google Scholar) and a cover letter to Hervé Le Borgne , Olivier Ferret , Sahar Ghannay and Anne Vilnat .