Newsletter #24 – February 2023
Dear reader,

This month’s edition starts off with an Open Call by the European Commission for €20 million for proposals on Natural Language Understanding and Interaction in Advanced Language Technologies.

Furthermore, we’re introducing OSCAR, an open source project aiming to provide web-based multilingual resources and datasets for Machine Learning and Artificial Intelligence applications. OSCAR is particularly popular for training large language models. Our tools and resources section features ModelFront, an API that predicts the quality of machine translations.

After last month’s announcement of the SRIA Open Call results, we’re taking a closer look at two of the selected projects: NGT-Dutch Hotel Review Corpus by Tilburg University and Building E2E spoken-language understanding systems for virtual assistants in low-resources scenarios by Balidea.

The “From the SRIA” section presents some of the Policy Recommendations included in the Strategic Agenda and Roadmap.

With best regards

Georg Rehm
The European Commission recently published a €20 million call for proposals on ‘Natural Language Understanding and Interaction in Advanced Language Technologies’ under the HORIZON EUROPE research program. The call will be open until 29 March 2023, followed by an evaluation period expected from April to May.

The call covers the following topics:

Improving context-aware human-machine interaction to increase understanding and exploitation of the interaction context and content in multimodal settings in order to increase responsiveness of interactive AI solutions, such as smart assistants, conversational and dialogue systems, content generation models, etc.

Supporting and enhancing seamless human-to-human communication across languages e.g. by means of automatic translation or interpretation (incl. automatic subtitling) in real time with a greater understanding of the communication context and the meaning involved in it.
Language Technology and NLP in the news
Social media highlights
General News
The OSCAR project is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project provides large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. It has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has put special attention on improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.

It originally started with a Common Crawl-based textual corpus developed to train contextualised word embedding in as many languages as possible. OSCAR is maintained by its originators Pedro Ortiz (DFKI, Germany) and Benoît Sagot (Inria, France) and their colleagues. Many of the project’s users have also become active contributors. OSCAR has sparked many collaborations with other research institutions and funded research projects in Germany, Europe and abroad such as DFKI, Inria, Mannheim University, LMU, Common Crawl, Huma-Num, Hugging Face, among others.

The OSCAR project has now released three textual corpora (around 15 TB of raw text combined), as well as two high performance data pipelines for reproducing these corpora. The project has also released highly optimised tooling that allows users to effectively extract and explore the data on the OSCAR corpora.

The corpora produced by the OSCAR project have been used in other successful ML/AI initiatives such as, among others, OpenGPT-X, BLOOM and ROOTS (BigScience), CamemBERT, German BERT and others in more than 30 different languages. The OSCAR corpora data quality has been automatically evaluated multiple times and has also been part of an extensive human evaluation.

Further details at:
Selected new tools and resources on the
European Language Grid
ModelFront is an API that predicts if a certain machine translation output is good or bad. It learns from post-editing data, to reflect domain, terminology, and style – even for specific brands, products or quality tiers. It can be integrated into any Translation Management System and used with any Machine Translation engine. For hybrid translation, the TMS first calls the MT API for new segments. It then calls the ModelFront API to get a quality prediction for each new segment. Now the TMS uses the good MT segments like 100% TM matches. The good MT segments can be “translated”, “approved”, “confirmed” or locked and are included only for context. Only the remaining bad MT segments get full human post-editing.
General news
After last month’s announcement of the open call results and a first project introduction, we want to introduce two more selected FSTP projects.

Tilburg University’s NGT-Dutch Hotel Review Corpus is a project that focuses on sign language technology. It aims to create a parallel corpus of Dutch and Sign Language of the Netherlands (NGT) in order to support the research community and provide the still very limited and underrepresented sector of sign language technology with high quality data. The data will consist of hotel reviews translated from Dutch to NGT videos by deaf translators to build a dataset for more robust Sign Recognition. Choosing deaf translators is going to ensure the authenticity of the NGT signed as well as reduce the influence of the source language as much as possible. Collecting data limited to one specific topic is going to account for the variability of constructions and inter-/intra-signer variation. The target is to build a parallel corpus of Dutch text and NGT videos consisting of roughly 200 hotel reviews. The corpus will be made publicly available through the ELG and the CLARIN platform, thus contributing to the development of more inclusive language technology.

Balidea’s project, titled Building E2E spoken-language understanding systems for virtual assistants in low-resources scenarios, aims to provide guidance for designing and collecting datasets for end-to-end spoken-language understanding systems. It will guide on how to design SLU datasets in low-resource scenarios, establishing the required characteristics and proposing design quality measurements, as well as on how to design data collection campaigns with a focus on target users. It will then present lessons learned with the campaigns as well as methodologies to validate collected datasets and assess their quality.

In order to achieve the project’s goals, it plans a study on the minimum design features of a SLU dataset for low-resource scenarios with a high presence of linguistic variety. It will propose quality measures, regardless of the language of application, to determine the complexity of the designed dataset in order to be able to establish minimums in the design and collection of data. The project can contribute to the success of the strategic agenda by establishing guidelines on how to approach an E2E SLU project in a low-resource scenario.
From the SRIA
Policy Recommendations

After last month’s look at the SRIA Implementation Recommendations, we’re dedicating this month’s section to introducing some of the Policy Recommendations.
In order to achieve the overall goals of digital language equality and deep natural language understanding and to reinforce European leadership in Language Technology, the ELE programme should be established as a large-scale, long-term coordinated funding programme. Europe’s over 60 regional and minority languages should receive comprehensive EU-level legal protection and the rights of national and linguistic minorities in the digital world should be officially recognized. Furthermore, mother-tongue teaching should be encouraged for speakers of official and non-official EU-languages.

Funding should be safeguarded to meet the computational power and data access needs of the new technological approaches. Current funding schemes like Horizon Europe and Digital Europe should be further enhanced with specific programmes boosting long-term research and transfer of technology and knowledge between countries and regions, as well as between industry and academia.

You can read more about all SRIA recommendations here or take a look at the full document.
If you would like to voice your support for the ELE Programme and its goal and vision to achieve digital language equality in Europe by 2030, please consider filling out the endorsement form by clicking the button below and become a listed supporter on the ELE website:
Click here to endorse the ELE SRIA
Next edition

The next ELT newsletter will be sent out on 7 March 2022. Until then, follow our ELT social media accounts (as linked below) for the latest news!

Want to learn more? Visit 
or contact us directly.
Copyright © 2022 ELE and ELG Consortium, All rights reserved.
Why did I get this email?
The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).
The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)
Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.