Cross-modal reciprocal learning: location as supervision

31/03/2023 – 01/04/2023 all-day

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : LASTIG
Durée : 36 mois
Contact :
Date limite de publication : 2023-03-31

Contexte :
Earth Observations (EO) nowadays refer to sensors with different modalities capturing rich and complementary information. EO are typically large-scale, multimodal, and multi-resolution. They often display a complex structure in which the spatial, temporal,
and spectral dimensions are entangled. This structure is influenced by
phenomenons operating at widely different scales, complex domain shifts (the distribution and appearance of certain classes vary significantly across regions and throughout the year), atmospheric conditions, and cumulative weather, resulting in inherently
non-stationary processes. While these characteristics complexify the analysis of EO, their absolute
spatial and temporal referencing of remote sensing allows us to align different acquisitions easily.

The deep learning paradigm is gradually being adopted as an indispensable tool for the automatic analysis of EO. However, the state-of-the-art solutions suffer from
multiple limitations: (1) existing deep-based models are not well-suited to the structure of EO, (2) large-scale annotation is unrealistic, (3) the limited spatio-temporal extent and task-specificity of existing
datasets and models result in poor generalization, (4) most annotated datasets remain focused on specific areas and environments, biasing the community towards particular architectures.

We propose to address the previous limitations by developing a Foundation Model of Earth Observations, trained on extensive unannotated data. Such a model should be rich, expressive, compact and efficient for most domain shifts and geographical bias. We will explore the idea of reciprocal cross-modal training to leverage the multimodality and georeferential alignment of EO data.

Sujet :
Foundation models refer to large neural networks trained on a vast amount of data and widely used for further applications or research. Such models can be built upon without having to be trained from scratch. This strategy both saves computation time and provides expressive
features/state-of-the-art encoders, and decreases the annotation and hardware requirements usually associated with large modern networks.
We propose a new learning paradigm dubbed “Cross-modal reciprocal learning” to train such foundation models from unannotated multi-modal EO.

Text-Image contrastive pretraining has shown impressive results in computer vision, leading to expressive and influential
foundation models. Contrastive learning has been recently explored for EO by exploiting spatial alignment across time series or for cross-modal localization. We propose generalizing text-image contrastive learning to the multi-modal setting with spatially aligned
observations. We require the features extracted from acquisitions of an area through different modalities to be more similar than any descriptors of another area. By forcing spatial alignment across sensors, the features must describe the only shared
latent variable: the actual semantics of the acquired area (e.g., road, building, plant species).

Generalizing contrastive learning to the multi-domain setting raises several theoretical and technical
challenges. First, the classic two-modalities formulation leads to an exponential complexity w.r.t the number of sensors, quickly becoming impractical and requiring us to develop an adapted loss and sampling procedure. Second, cross-modal learning implies the simultaneous training of several
large networks and the manipulation of costly multi-modal batches. This raises technical issues such as inefficiency in memory use and prolonged training times. Lastly, if we only reward the encoders
for spatial alignment across modalities, they may be discouraged from retaining sensor-specific information. This would result in weaker individual representations discarding the strength of each
sensor at the benefit of the lowest common denominator. This critical concern may be mitigated with multi-task learning and self-supervision.\

Depending on the application cases, we will focus on several modalities among the six following ones: Sentinel-2 Optical Time Series, Sentinel-1 Radar Time Series, High Resolution orthoimages, hyperspectral images, aerial LiDAR, and spaceborne LiDAR.

The PhD project also includes application cases related to the exploitation of the designed model. We will focus on standard downstream tasks such as multi-domain land-cover classification or crop mapping, and biogeophysical variable estimation (e.g., biomass estimation).

Profil du candidat :
– Familiarité avec la vision par ordinateur, l’apprentissage machine et l’apprentissage profond;
– Maîtrise de Python et familiarité avec PyTorch;
– Curiosité, rigueur, motivation;
– (Optionnel) Familiarité avec l’apprentissage auto/faiblement supervisé et contrastif;
– (Optionnel) Expérience avec l’imagerie aérienne ou satellite et la classification pour l’occupation des sols.

Formation et compétences requises :
Master 2 en informatique, computer sciences, mathématiques appliquées ou télédétection

Adresse d’emploi :
IGN, 73 avenue de Paris, 94160 Saint-Mandé

Document attaché : 202301111252_These2023_LASTIG_EO_reciprocal_learning.pdf