Robust-to-noise information extraction, unifying challenges of optical character recognition (OCR) and automatic speech recognition (ASR)

When:

04/04/2026 – 05/04/2026 all-day

2026-04-04T02:00:00+02:00

2026-04-05T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : La Rochelle Université – Laboratoire l3i
Durée : 36
Contact : mickael.coustaty@univ-lr.fr
Date limite de publication : 2026-04-04

Contexte :
The growing digitization of written and oral content has made Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) essential in cultural heritage preservation, media accessibility, legal documentation, knowledge management and information retrieval. However, the outputs generated by these systems are inherently noisy: OCR is affected by document degradation, layout complexity or poor scanning quality, while ASR suffers from background noise, overlapping speech or non-standard oral expressions. Despite significant progress, it remains pervasive, and imperfections directly impact natural language downstream tasks where data quality is a key prerequisite. Although OCR and ASR face many similar error phenomena, their correction has mostly been studied in isolation, resulting in a lack of unified methodologies.

Sujet :
Objectives:
• Compare and analyse existing post-correction methods in OCR and ASR and potential for cross-domain adaptation.
• Develop unified approaches for post-correction that leverage the shared error patterns between OCR and ASR.
• Enable robust information extraction from noisy OCR and ASR outputs by designing strategies that mitigate the propagation of recognition errors into downstream NLP tasks.
Scientific challenges:
• Heterogeneity of noise sources: OCR errors are generated from visual artifacts while ASR errors are acoustic, a unified framework must generalize across modalities.
• Domain adaptation: OCR/ASR models often struggle on domain-specific datasets (e.g., historical texts, administrative documents, technical reports, scientific papers…) requiring correction methods that adapt to varying contexts.
• Complex error structures: beyond character and subword substitution, OCR/ASR introduce higher-level disruptions (mis-segmentation, overlapping text blocs or speech, layout misinterpretation) that complicate correction.
• Evaluation difficulties: classical metrics such as Character Error Rate (CER) or Word Error Rate
(WER) fail to fully capture the impact of errors on downstream information extraction, that
necessitate new evaluation methods.
• Scalability: correction methods must be applicable to large-scale corpora and adaptable to new
data without full retraining.
To tackle these challenges, the thesis will explore a combination of:
• Comparative state-of-the-art analysis: systematic benchmarking of existing OCR and ASR
post-correction methods on heterogeneous corpora.
• Unified modeling approaches: leveraging neural architectures (e.g., sequence-to-sequence
models, transformers, multilingual pre-trained LLMs) that can learn correction patterns across
both modalities.
• Hybrid methods: integrating symbolic rules, edit distance algorithms, and domain-specific
lexicons with machine learning models to improve robustness.
• Error modeling and simulation: designing artificial noise injection techniques to train models on
synthetic but realistic OCR/ASR-like errors, thus improving generalization.
• Evaluation frameworks: extending standard CER/WER with task-oriented metrics reflecting the
quality of downstream information extraction and retrieval.
This thesis helps to overcome the current limitations of automatic correction of texts produced by OCR
and ASR systems by proposing a unified approach, which represents a significant scientific advance. In
fact, in-depth analysis of the similarities and differences between OCR and ASR errors will provide a
better understanding of how these two fields can intersect. This project will enable the development of
more robust methods based on multidisciplinary knowledge from natural language processing, signal
processing, and image processing. The expected results will thus offer new perspectives in the
development and use of multimodal language models, contributing to the evolution of generative AI in
both language processing and signal processing. With the rise of multimodal databases (text, image,
audio, video), this thesis could inspire the creation of tools capable of simultaneously exploiting data
from various sources to extract more relevant information. The thesis is expected to deliver a
contribution to the bridging of OCR and ASR research communities and opening new research avenues
in multimodal NLP.

Profil du candidat :
The highly motivated candidate should hold a master’s degree in computer science or a related field. She/he should have
a strong background in NLP with an interest in text processing and multimodal data (text, speech,
document images). Familiarity with generative AI methods (e.g., large language models, text-to-text
generation, deep learning, fine-tuning strategies) will form a strong asset.

Formation et compétences requises :
Master of science in computer science, ai or applied mathemics, or any equivalent diploma

Adresse d’emploi :
mickael.coustaty@univ-lr.fr

Document attaché : 202603191652_Alloc_Doc_AI_DH_Coustaty_Suire_Public.pdf

MaDICS

Masses de Données, Informations et Connaissances en Sciences

Big Data - Data Science

Robust-to-noise information extraction, unifying challenges of optical character recognition (OCR) and automatic speech recognition (ASR)