Challenges of Mixed Data Clustering

Calendar

When:

29/02/2024 – 01/03/2024 all-day

2024-02-29T01:00:00+01:00

2024-03-01T01:00:00+01:00

Stages

Offre en lien avec l’Action/le Réseau : SimpleText/– — –

Laboratoire/Entreprise : DVRC
Durée : 4 mois
Contact : sonia.djebali@devinci.fr
Date limite de publication : 2024-02-29

Contexte :
Industrial context

The energy sector is in the midst of significant transformation, prompted by the need to increase the use of renewable energy sources and improve energy efficiency, becoming a Smart Grid. This cutting-edge technology allows for the analysis, management, and coordination of energy production, consumption, and distribution, all with the goal of promoting more sustainable practices. A challenge arises from the fact that the data is mixed, containing both numerical and categorical information, often in the form of a data stream. Analyzing this kind of data requires adapted methods. As a result, traditional methods that are designed for numerical data are not well-suited to this type of data.
Advanced tools for analyzing complex systems that can handle rich and heterogeneous data are crucial for Trusted Third Parties for Energy Measurement and Performance to provide independent energy performance analysis and recommendations for clients. It is important that these tools are also easily interpretable by energy experts to facilitate classification and recommendation.
Creating clusters of similar buildings is an effective way to handle complex energy data. Hierarchical clustering of mixed data is a crucial approach that allows energy experts to easily associate clusters with recommendations. It is an essential tool for not only the energy sector but also has diverse applications in fields such as biology, medicine, marketing, and economics.

Sujet :
Scientific context

Although mixed data is widespread, clustering tools specifically designed for it are limited. Some of the bottlenecks have already been defined in a previous scientific paper. Here is a non-exhaustive list of bottlenecks one can encounter when handling mixed data in a pipeline:

Data preprocessing: Data preprocessing is a critical step in mixed data clustering like handling missing data, encoding categorical data, and scaling numerical data.
Feature selection: Mixed data clustering requires feature selection to be performed before clustering. However, selecting relevant features can be a challenging and time-consuming task.
Metric selection: Choosing the right distance metric to measure the similarity between different data types.
Evaluation: There is a lack of standard evaluation criteria for mixed data clustering, which makes it hard to compare different methods.
Computational complexity: Mixed data clustering involves dealing with different types of data and distance metrics, which can result in high computational complexity.
Visualization: It is difficult to create visualizations that effectively communicate the relationships between different data types.
Interpretation: Understanding the relationships between different data types can be challenging, especially if the clusters are not well-separated or the data are altered before using any methods.

Profil du candidat :
Etudiant(e) de niveau M1 ou M2 en informatique (Master ou école d’ingénieurs).

Formation et compétences requises :
Connaissance en Machine Learning, Clustring, Python et expérience dans l’utilisation de bibliothèques de ML,
Forte appétence pour la recherche académique
Capacité à effectuer des recherches bibliographiques
Rigueur, synthèse, autonomie, capacité à travailler en équipe

Adresse d’emploi :
Pole Léonard de Vinci
92 916 Paris La Défense Cedex

Document attaché : 202312221037_2024_Stage_MixedData.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

Challenges of Mixed Data Clustering