Mixed data temporal clustering for modelling longitudinal surveys

20/04/2021 – 21/04/2021 all-day

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : ERIC, Université de Lyon
Durée : 3 ans
Contact : julien.jacques@univ-lyon2.fr
Date limite de publication : 2021-04-20

Contexte :
In many areas of humanities and social sciences, the studies are based on questionnaires completed by participants. Often, these questionnaires are completed several times over the study period. The researchers then analyse these questionnaires to determine typical behaviours within the studied population. But the statistical analysis of these questionnaires is far from simple, for several reasons. First, the answers to the questions are often of different types: nominal categorical (for example “what is your socio-professional category?”), ordinal categorical (for example “what is your level of satisfaction: bad, average, good?”), quantitative (“what is your age?”), textual (for open questions with free answer). The analysis of such mixed data is a current research problem in the fields of statistics and machine learning, and for lack of an existing solution the practitioner often tends to transform the data to standardize them. Such approach is not satisfying since it leads either to the introduction of a bias or to an important information loss.
The second scientific obstacle is the modelling of the temporal evolution of the answers to the questions. Currently, the analyses are done independently at each temporal phase, then researchers try a posteriori to find links between these different analyses, by seeking from one phase to the other to find similar typical behaviour. The ideal way to model these data would be to propose a model of the temporal evolution, which models all the responses to the questionnaires at the same time. Thus, the analysis will exhibit typical temporal evolution behaviours, which are the objects which researchers in human and social sciences wish to study.
This thesis will thus provide a complete tool for analysing questionnaires repeated over time. The core of the thesis will be the development of a statistical model and associated inference algorithms. But the PhD student will go as far as the implementation of a software tool in the form of an R package, so that researchers in humanities and social sciences can easily use these results.

Sujet :
In many areas of humanities and social sciences the studies are based on questionnaires completed by participants. The data provided by the answer of participants are of different nature, generally quoted as mixed data in the literature:
– nominal categorical: when questions are of type “what is your socio-professional category?”
– ordinal categorical: when questions are of type “what is your level of satisfaction: bad,
average, good?”,
– quantitative: when questions are of type “what is your age?”,
– textual: for open questions with free answers.
Moreover, we consider repeated questionnaires over time: the participants filled in the questionnaires at several times along the period of study. A time component is then added to the mixed data set.
The machine learning task to which we want answer for this data is unsupervised: there is no specific notions that we want to predict, but we want to explore the data sets in order to exhibit typical behaviours. This task is known as to clustering: we want to build clusters of data such that observations within a cluster are similar and clusters are different from each other. Thus, the data analysis will no longer be based on the observation of the individual responses to the questionnaires, but on the summaries provided by the clusters.
More specifically, once the whole data set of observations over time will be clustered, the clusters will gather set of participants which have the same evolution of answers over time. This information is essential for data analysis from a humanities and social sciences point of view.

Profil du candidat :
Master ou Ingénieur en mathématiques appliquées, informatiques.

Candidature sur

Formation et compétences requises :
Machine learning, R ou Python, Statistical learning

Adresse d’emploi :
Université Lyon 2, Campus Porte des Alpes, Bron

Document attaché : 202104010502_IA PhD – JACQUES – PRIM-ALLAZ.pdf