Towards a Generic Storage and an Adaptive Query Optimization for Astronomical Data Management at Scale

When:

30/04/2019 – 01/05/2019 all-day

2019-04-30T02:00:00+02:00

2019-05-01T02:00:00+02:00

Annonce en lien avec l’Action/le Réseau : MAESTRO

Laboratoire/Entreprise : Laboratoire DAVID – Université de Versailles Saint-Quentin
Durée : 3 ans
Contact : Karine.Zeitouni@uvsq.fr
Date limite de publication : 30-04-2019

Contexte :
La quantité de données produites a littéralement explosé dans les dix dernière années. En dépit de l’avancé extraordinaire des technologies de son support, le traitement des données massives (Big Data) est encore un sujet de recherche très actif en bases de données.
Dans le cadre d’un précédent projet doctoral co-financé par l’UVSQ et le CNES, l’équipe UVSQ/ADAM a proposé et implémenté un cadriciel (ASTROIDE – https://cnesuvsqastroide.github.io) permettant le requêtage efficace des observations astronomiques. En s’appuyant sur l’expérience passée, ce projet de thèse vise à étendre le cadriciel ASTROIDE vers plus de génériciité, plus de fonctionnalité et une meilleure adaptabilité du système.
Plus exactement, un des objectifs de cette thèse est de proposer un stockage générique et optimisé comme une alternative aux modèles d’échange traditionnels de données astronomiques, et de l’intégrer aux nouvelle architectures distribuées. Le second objectif est d’optimiser les performances du système global tout au long de son exécution. Une piste serait l’application des techniques d’apprentissage basées sur les traces d’exécution historiques pour optimiser les requêtes en cours.
The amount of data daily produced has dramatically exploded over the last decade. In spite of the extraordinary advance of its supporting technology, Big Data remains a hot topic of database research.
In the framework of a co-funded PhD thesis by UVSQ and CNES, UVSQ/ADAM group has proposed and implemented a distributed framework ASTROIDE for efficient querying of astronomical surveys.
Drawing on past experience, this thesis proposal aims at extending ASTROIDE framework towards more genericity, more functionality, and better adaptivity of the system.
Precisely, one of the objectives of this PhD thesis is to propose an optimized generic storage and exchange model for astronomical data, and to implement it under modern distributed processing environment. The second objective is to improve the overall system performance throughout its execution. At this end, we envision applying machine learning techniques based on the previous execution traces in order to optimize the current query execution.

Sujet :
Efficient query processing of astronomical data leads to optimize the data storage. Today, the most used formats in astronomy are FITS, HDF5, gbin, or simple csv/gzip, mainly for data exchange purpose. There disadvantage w.r.t. big surveys lies either in their complexity or in the lack of compacity. More importantly, they all require a significant over-cost to be loaded and used in NoSQL systems [16], like Spark or well-known packages (e.g., pandas). Nowadays, Parquet format [12], recommended by the Apache consortium, is becoming a de facto standard adopted by a large variety of Big Data tools, and NoSQL systems. Indeed, it is a compact columnar storage format, which is auto-descriptive, allows compression, data partitioning, and “rows group” indexation. However, there exists a gap between the astronomical standards and Parquet, as a matter of fact. One of the objectives of this PhD thesis is to fill this gap by proposing an optimized generic storage and exchange model for astronomical data in a distributed processing environment. The use of Parquet, or equivalent formats (e.g., kudu), will favour its adoption, since various systems use them as a native storage.
To this end, several issues must be addressed: How to partition these data across processing nodes? How to index these data? How to deal with updates? How to measure the performance in term of ingestion, access & filtering, updates, etc.? What is the impact of the parameters, and how to tune them?

An adapted data storage structure should optimize a system overload. This is guided by the response time, and throughput for a query load as well as the resources variation. A possible solution is to monitor the activity of the system, learn the performance behavior from previous execution traces, which allows optimizing the current or the next executions. This can be done by keeping track of the system dynamics, in term of execution performance, and the resource consumption. A significant decrease of performance may automatically trigger either data re-organization, such as partitioning, local indexing, caching, or the adaptation of resources allocation (number of executors, memory, and number of CPU per executor).
The questions are: What are the parameters to collect, and what are the performance metrics? How to establish a cost model and a distributed caching technique? Which access path to choose when different indexing and/or storage methods are possible? How to adapt to the data and the workload profiles? How to optimize complex queries, and mixed query and update workloads?

Profil du candidat :
Le candidat doit détenir un diplôme de Master en Informatique ou équivalent. Il doit montrer:
– des compétences confirmées en programmation objet et système, en systèmes et bases de données – des connaissances en fouille de données et en apprentissage
– un bon niveau d’expression / communication à l’oral ety à l’écrit en anglais
– la connaissance du Français est souhaitable mais non obligatoire
==
The applicant should hold a Master diploma in Computer science, or equivalent. She/he should have:
– Strong object and system programming, and database skills
– Good background in data mining / machine learning
– Good English oral communication, technical reading and writing skills
– Proficiency in French is desirable but not mandatory

Formation et compétences requises :
Master en Informatique ou équivalent.
/
Master diploma in Computer science, or equivalent.

Please submit your application including:
– cover letter
– CV
– copies of the relevant certificates
– the academic transcripts of the 2 last years
– list of references
– any complementary document: recommendation letters, relevant publications if exist.

On the website of the doctoral School. See:
https://www.universite-paris-saclay.fr/en/education/doctorate/sciences-et-technologies-de-linformation-et-de-la-communication-stic-0#the-doctorate

and in parallel by email to:
Prof. Karine Zeitouni
DAVID Lab – University of Versailles Saint-Quentin – Paris Saclay University
www.david.uvsq.fr/zeitouni
www.universite-paris-saclay.fr
E-mail: Karine.Zeitouni@uvsq.fr

Adresse d’emploi :
DAVID Lab. – Université de Versailles – Paris Saclay University
45 Avenue des Etats-Unis
78035 Versailles Cedex
http://perso.prism.uvsq.fr/users/zeitouni/sujetThese_ED580_22969.pdf

Document attaché : sujetThese_ED580_22969.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

Towards a Generic Storage and an Adaptive Query Optimization for Astronomical Data Management at Scale