Newsletter #25 – March 2023
Dear reader,

in addition to the ubiquitous news around ChatGPT currently dominating the international Language Technology, AI and NLP landscape, we have a variety of other exciting news to share!

First, as already mentioned in our special newsletter issue, META-FORUM 2023 has officially been announced. The conference will take place on 27 June in Brussels. The online registration is already open so please save the date and register your attendance if you want to participate in the final ELE conference.

In this newsletter we’re introducing two new tools, among them an OpenGPT-X text completion model for the German language, as well as an entity detection engine available in 29 different languages.

We’re also taking a look at two more selected FSTP projects: European LT Domains 2023 (EuLTDom2023) by the University of Zagreb and a project by Pangeanic that focuses on the Generation of a large speech corpus for Spain languages using Data Augmentation.

In the section “From the SRIA”, we present the Machine Translation Recommendations included in the Strategic Research, Innovation and Implementation Agenda and Roadmap.

With best regards

Georg Rehm
 
Language Technology and NLP in the news
Social media highlights
General News
The ELG platform allows uploading tools and services into the platform in order to provide a convenient environment to make them available for testing. One such tool with ELG integration we want to introduce in addition to our monthly highlight section: A German language OpenGPT-X webpage/blog/article completion model. The ELG service provides a user interface for interacting with the language model. The model is a causal language model trained to predict the next token based on its input. The model was trained using the multilingual BLOOM model and the CLP-Transfer method on approximately 50 billion German tokens. It works best by giving it input sentences of a web page similar to the content that should be generated. It features two different modes, a “Sample mode” for more creative, less accurate completions, and a “Greedy mode” that is more accurate but may be more repetitive in its output.
 
Selected new tools and resources on the
European Language Grid
HENSOLDT ANALYTICS Named Entity Detection – HENSOLDT ANALYTICS MediaMiningIndexer NED is a named entity detection engine. The service takes a text (file) input and returns annotations of the words with their class. It provides classification of named entities of following types: Person, Location, and Organization, with more types still in development.

In addition to English, the tool is also available for 28 other languages.
 
General news
In case you missed our special newsletter issue: our next conference – META-FORUM 2023 – will take place on 27 June in Brussels, Belgium. We will present the final results of the European Language Equality project and discuss all kinds of topics touching upon language technologies, language resources, language-centric AI and especially digital language equality. You can register for free here.

Slator published an article about our ELE SRIA endorsement call, highlighting the importance of language equality in the European Union as well as the ELE’s goals of reducing the technology gap between English and the other European languages.

We also want to introduce another two selected FSTP projects:

European LT Domains 2023 (EuLTDom2023) is a project by the University of Zagreb, Faculty of Humanities and Social Sciences. Language Technologies are used and developed to different extents in the several application domains. Some fields are well supported financially and with regard to the men and women power while others stay underrepresented. The project European LT Domains 2023 (EuLTDom2023) conducted by the Faculty of Humanities and Social Sciences of the University of Zagreb aims to collect and analyse publications using LTs in varying domains, identify currently underrepresented domains and present a current snapshot of the LT distribution among domains.

Pangeanic present a project for the Generation of a large speech corpus for Spain languages using Data Augmentation. For the development of robust speech technology, large speech corpora containing different kinds of voices, set-ups and background noises are needed. Pangeanic will create guidelines for building an extensive speech dataset and transcript through data augmentation along the way of building a diverse dataset for Euskera, Asturian, Catalan, Galician and Aranese. The data augmentation process entails adding variations and different noise types to the voice recordings in order to simulate potential obstructive elements that may impede accurate processing of voice data. The project will entail a planning phase, recording sessions, data augmentation and, lastly, generating scripts for data analysis and creating a guideline detailing the steps of the projects, including the lessons learned along the way. The goal will be to create a dataset and guidelines that can be used for future speech data collection.

Lastly, if you have a few minutes to spare, please consider filling out this Survey about Computational Facilities for Language Technology. It will be used for collecting the data within the European Language Equality 2 project (https://european-language-equality.eu/) and will result in a snapshot of the current situations and relevant recommendations for HPC use in NLP/LT.
From the SRIA
Research Topic: Machine Translation Recommendations

Machine translation can be a helpful tool to overcome language barriers within a multilingual Europe. In addition to plain translation, we envision the translation systems to be able to take cultural context-sensitive aspects into account. Multimodal MT systems could accommodate as a connection between external and internal information. Multimodality will also facilitate the inclusion of sign languages and more direct approaches to translate speech from one language to another.

To measure our current MT systems, new quality evaluation metrics need to be developed and evaluated. Those future metrics should reflect better human judgements, should ideally not depend on human reference translation, and be constructed in a flexible manner in order to be adaptable to different contexts.

To develop fair, multilingual, multimodal, context-sensitive MT systems in the future, we recommend more research focussing on those aspects for European languages. To include all European languages, efforts for low-resource MT need to be strengthened and further financially supported.

You can read more about all SRIA recommendations here or take a look at the full document.
If you would like to voice your support for the ELE Programme and its goal and vision to achieve digital language equality in Europe by 2030, please consider filling out the endorsement form by clicking the button below and become a listed supporter on the ELE website:
Click here to endorse the ELE SRIA
Upcoming Events

If you have an event that you think the European language technology community should know about, get in touch with us to have it featured in this newsletter.
 

Next edition

The next ELT newsletter will be sent out on 4 April 2022. Until then, follow our ELT social media accounts (as linked below) for the latest news!


Want to learn more? Visit https://european-language-technology.eu 
or contact us directly.
Website
YouTube
Twitter
LinkedIn
Copyright © 2022 ELE and ELG Consortium, All rights reserved.
Why did I get this email?
The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).
The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)
Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.