Topic Extraction and Alignment for Large Scientific Document Collections

When:

15/12/2017 – 16/12/2017 all-day

2017-12-15T01:00:00+01:00

2017-12-16T01:00:00+01:00

Annonce en lien avec l’Action/le Réseau : aucun

Laboratoire/Entreprise : LIP6 – UPMC
Durée : 36 mois
Contact : bernd.amann@lip6.fr
Date limite de publication : 2017-12-15

Contexte :
This thesis is financed by the EPIQUE ANR project (http://www-bd.lip6.fr/wiki/site/recherche/projets/epique) and takes place in the Database research team (http://www-bd.lip6.fr) of the LIP6 laboratory (http://www.lip6.fr) in Paris. The goal is to develop new tools for exploring large scientific document collections (Web of Science, Medline, …) and building interactive topic evolution maps or “phylomemies” [1]) for representing the evolution of science. These tools are based on efficient algorithms and data structures implemented on top of recent big-data infrastructures like Apache Spark.

Sujet :
A topic evolution map represents the evolution of science by a set of topics over a sequence of time periods where topics from different periods can be aligned through specific evolution links. For example, data related research topics have rapidly evolved during the last 25 years period where new research topics have appeared (noSQL, Big Data, MapReduce Data Processing, Data Science, Deep Learning), often by replacing, splitting or combining previous research topics (semi-structured data, parallel DBMS, machine learning, neural networks). Building such maps is a complex task including a variety of data processing steps (see more details in the EPIQUE project description).

This thesis is mainly deals with two steps of the EPIQUE workflow:

Topic extraction step: The first step is to extract semantic topic structures from large complex real-world document collections in different application domains (science, social web, news). There already exists a large spectrum of topic extraction models and algorithms based on graph clustering, matrix factorization (LDA) and other techniques. Existing topic models and algorithms do not scale and a first challenge will be to define and adapt scalable data mining solutions based on new data structures and recent parallel data processing frameworks [2,3,4,5].

Topic alignment step: The second step consists in exploring the evolution of science by aligning semantic topic structures from different time periods. This alignment is based on a topic evolution model representing different semantic evolution steps (birth, split, join, death, …) for topics from different time periods [1].The goal is to propose a formal topic evolution model based on existing work on scientific evolution and to implement efficient algorithms for the temporal alignment of semantic topic structures generated by step 1.

EXPECTED RESULTS

The first outcome of this thesis will be new innovative tools for the reconstruction and exploration of multi-scale dynamics in complete real-world scientific corpora and for obtaining new insights in the evolution of complex human generated knowledge and information. The second outcome will be new large-scale data processing solutions for implementing advanced text and graph mining algorithms. Our goal is in particular to provide generic low-level solutions which can be customized independently of the higher-level mining algorithms with respect to specific cost models and hardware constraints (memory, CPU, network).

START DATE : February 1st 2018

Profil du candidat :
Applicants should have strong analytical programming skills (Java, Scala, Python), a high capacity to understand new concepts and to work independently, a good expertise in database related topics (distributed databases, query optimisation, big data platforms).

Applicants will have to send an email and attach:
* an application letter in English or French
* their CV
* their university/grade transcripts of the last two years
* a copy of their last diploma
* recommendation letters (optional)
to bernd.amann@lip6.fr and hubert.naacke@lip6.fr .

Formation et compétences requises :
Applicants must hold a Master’s degree in Computer Science (or have an equivalent academic background) and have excellent written and oral communications skills in English (French is a plus).

Adresse d’emploi :
The LIP6 Laboratory of Computer Sciences, Paris http://www.lip6.fr ) with a staff of 470 people including 170 permanent researchers, 250 PhD students, Postdocs, engineers and administrative employees is today one of the most important centers of Computer Science in France. LIP6 is part of the Université Pierre et Marie Curie and as a department of CNRS (UMR 7606), it is also linked to the INS2I (Institut des sciences de l’information et de leurs interactions). The LIP6 laboratory is composed of 20 research teams structured into 7 departments which cover a wide spectrum of computer science domains: scientific computing, decision making, optimization problems in artificial intelligence and operational research, databases and machine learning, networks and systems, systems on chips, complex systems.

Document attaché :

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

Topic Extraction and Alignment for Large Scientific Document Collections