Analysis of the health content of a corpus of tweets unsing the signature method

When:
01/03/2020 – 02/03/2020 all-day
2020-03-01T01:00:00+01:00
2020-03-02T01:00:00+01:00

Annonce en lien avec l’Action/le Réseau : aucun

Laboratoire/Entreprise : IECL, Nancy
Durée : Six mois
Contact : marianne.clausel@univ-lorraine.fr
Date limite de publication : 2020-03-01

Contexte :
In partnership with Laboratoire d’Informatique de Grenoble, we have collected tweets for three years. Our goal is to understand the different factors involved in some ailments as well as the links between these ailments. In a preliminary work [3], we developed two probabilistic models TM-ATAM and T-ATAM extending Latent Dirichelet Allocation allowing us to summarize the health content of a corpus of tweets and taking into account time.

Sujet :
The output of the method is a vector valued time series that we analyzed using statistical tools. Notably, we detected change points in the health content of our corpus providing a relevant way to detect transitions in the environemental context (for e.g. seasons). We aim at combining this model and recent tools coming from rougths paths theory [1,2] to give new insights on the two models TM-ATAM and T-ATAM.

In particular, we aim at identifying causality relations between ailments as well as use the skew symmetric nature of order 2 signature to cluster the data. The internship will be divided into two parts : understanding of TM-ATAM/T-ATAM and signature method, and thereafter application on our real data.

Contacts: Massih-Reza Amini (Massih-Reza.Amini@imag.fr), Antoine Lejay (antoine.lejay@inria.f), Marianne Clausel (marianne.clausel@univ-lorraine.fr).

Profil du candidat :
Master 2 in statistical learning

Formation et compétences requises :
Strong programming skills in Python, knowledge in statistical learning

Adresse d’emploi :
Institut Élie Cartan de Lorraine
Université de Lorraine, Site de Nancy
B.P. 70239, F-54506 Vandoeuvre-lès-Nancy Cedex

Document attaché :