TOWARDS AN OPTIMIZED AND GENERIC STORAGE MODEL FOR ASTRONOMICAL DATA IN SPARK

When:

13/02/2019 – 14/02/2019 all-day

2019-02-13T01:00:00+01:00

2019-02-14T01:00:00+01:00

Annonce en lien avec l’Action/le Réseau : MAESTRO

Laboratoire/Entreprise : DAVID – Université de Versailles
Durée : 5 à 6 mois
Contact : Karine.Zeitouni@uvsq.fr
Date limite de publication : 2019-02-13

Contexte :
Applications in universe science are among the most demanding of Big Data technology. Indeed, recent new programs for sky and earth surveying
will produce peta bytes of data. Exploratory analysis of these data is crucial to enable scientists and practitioners to better understand their data and optimize various processes. This requires ecient database systems to manage and query these unprecedented amount of data.
Efficient query processing of astronomical data leads to optimize the data representation. Today, the most used formats in astronomy are FITS, HDF5, or simple csv, mainly for data exchange purpose. Besides, Parquet format, recommended by the Apache consortium, is becoming a de facto standard adopted by a large variety of Big Data tools, and NoSQL system. However, there exists a gap between the astronomical standard formats and Parquet, as a matter of fact. More importantly, due to the amount of astronomical data, it adds a significant over-cost to the loading process in NoSQL systems like Spark, since the data should be converted from FITS to the format adopted in the target system.

Sujet :
The main objective of this internship is to ll this gap by proposing an optimized generic storage in Spark to represent at least FITS and HDF5 data formats into Spark DataFrame. A focus, in the proposed solution is to take the advantages of FITS/HDF5 data organization for optimizing current existing astronomical operators. The proposed design should be scalable
and support incremental upload of large datasets, and optimize the related performance. The internship will take place as follows:
– At first, the trainee will get acquainted with the team’s knowledge about ASTROIDE (a distributed data server for big astronomical data https://cnesuvsqastroide.github.io) and NoSQL technologies required by the project.
– Next, she/he will propose a baseline solution, not necessarily optimal from the querying point of view, but more optimal to load FITS and HDF5 into DataFrame.
-Finally, she/he will optimize further both the ingestion and the query performances and compare them to the baseline.

Profil du candidat :
We seek highly motivated and ambitious candidates with a deep interest in working on big data technology, with strong object oriented programming skills. The candidate should be familiar with Unix scripting environment and tools like git, maven, . . . This internship may open the way to a PhD thesis in collaboration between DAVID Lab at UVSQ/Paris-Saclay University and the CNES (Centre National d’Etudes Spatiales). A good background in data mining / machine learning is a plus for the purpose of the PhD thesis.

Formation et compétences requises :
Open for MASTER 2 level students or equivalent in computer science in the domain of data engineering or data science.

Adresse d’emploi :
DAVID Laboratory (located in Versailles city – France), University of
Versailles Saint-Quentin / University of Paris-Saclay.
45 Avenue des Etats-Unis – 78000 Versailles, France.
web: www.david.uvsq.fr

Document attaché : Master_Internship_Versailles.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

TOWARDS AN OPTIMIZED AND GENERIC STORAGE MODEL FOR ASTRONOMICAL DATA IN SPARK