*|MC:SUBJECT|*

Newsletter #25 – March 2023

Dear reader,

in addition to the ubiquitous news around ChatGPT currently dominating the international Language Technology, AI and NLP landscape, we have a variety of other exciting news to share!

First, as already mentioned in our special newsletter issue, META-FORUM 2023 has officially been announced. The conference will take place on 27 June in Brussels. The online registration is already open so please save the date and register your attendance if you want to participate in the final ELE conference.

In this newsletter we’re introducing two new tools, among them an OpenGPT-X text completion model for the German language, as well as an entity detection engine available in 29 different languages.

We’re also taking a look at two more selected FSTP projects: European LT Domains 2023 (EuLTDom2023) by the University of Zagreb and a project by Pangeanic that focuses on the Generation of a large speech corpus for Spain languages using Data Augmentation.

In the section “From the SRIA”, we present the Machine Translation Recommendations included in the Strategic Research, Innovation and Implementation Agenda and Roadmap.

With best regards

Georg Rehm

Language Technology and NLP in the news

“Call for Endorsement: ELE’s Strategic Agenda for Digital Language Equality in Europe” – Slator, 1 March 2023
“ChatGPT: five priorities for research” – Nature, 3 February 2023
“The profound danger of conversational AI” – VentureBeat, 4 February 2023
“Coming AI regulation may not protect us from dangerous AI” – VentureBeat, 4 February 2023
“English still number one EU language despite lack of native speakers” – The Brussels Times, 9 February 2023
“ChatGPT threatens language diversity. More needs to be done to protect our differences in the age of AI” – The Conversation, 9 February 2023
“Aleph Alpha believes Europe can compete with OpenAI if it ‘picks its battles’” – Sifted, 9 February 2023
“ChatGPT Is a Blurry JPEG of the Web” – NewYorker, 9 February 2023
“The Chat GPT Vs Developers.” – TechTalk, 13 February 2023
“Insights from an AI author: The geopolitical consequences of ChatGPT” – European Council on Foreign Relations, 15 February 2023
“You.com challenges Google, Microsoft with launch of ‘multimodal conversational AI’ in search” – VentureBeat, 15 February 2023
“Sacré bleu! EU job ads can’t limit 2nd languages to English, German and French, court rules” – Politico, 16 February 2023
“‘I Will Not Stop at Just Four Languages’ — EU Celebrates Young Translators Award” – Slator, 17 February 2023
“Introducing the AI Mirror Test, which very smart people keep failing” – The Verge, 17 February 2023
“What ChatGPT means for linguistic diversity and language learning” – University World News, 24 February 2023

Social media highlights

Adventures in ChatGPT ASCII art.
How would one go about applying the Dunning-Kruger effect to an understanding of ChatGPT?
One of the many endangered European languages, Breton went from a million speakers in the 1950s to less than 200,000 today.
New tools for grammar checking, sample definitions, and more in lexicography & terminology work, powered by OpenAI GPT.
A handy LLM Cheat Sheet featuring key terms and concepts all in one place.
John Oliver’s Last Week Tonight segment on Artificial Intelligence.

General News

The ELG platform allows uploading tools and services into the platform in order to provide a convenient environment to make them available for testing. One such tool with ELG integration we want to introduce in addition to our monthly highlight section: A German language OpenGPT-X webpage/blog/article completion model. The ELG service provides a user interface for interacting with the language model. The model is a causal language model trained to predict the next token based on its input. The model was trained using the multilingual BLOOM model and the CLP-Transfer method on approximately 50 billion German tokens. It works best by giving it input sentences of a web page similar to the content that should be generated. It features two different modes, a “Sample mode” for more creative, less accurate completions, and a “Greedy mode” that is more accurate but may be more repetitive in its output.

Selected new tools and resources on the
European Language Grid

HENSOLDT ANALYTICS Named Entity Detection – HENSOLDT ANALYTICS MediaMiningIndexer NED is a named entity detection engine. The service takes a text (file) input and returns annotations of the words with their class. It provides classification of named entities of following types: Person, Location, and Organization, with more types still in development.

In addition to English, the tool is also available for 28 other languages.

General news

In case you missed our special newsletter issue: our next conference – META-FORUM 2023 – will take place on 27 June in Brussels, Belgium. We will present the final results of the European Language Equality project and discuss all kinds of topics touching upon language technologies, language resources, language-centric AI and especially digital language equality. You can register for free here.

Slator published an article about our ELE SRIA endorsement call, highlighting the importance of language equality in the European Union as well as the ELE’s goals of reducing the technology gap between English and the other European languages.

We also want to introduce another two selected FSTP projects:

European LT Domains 2023 (EuLTDom2023) is a project by the University of Zagreb, Faculty of Humanities and Social Sciences. Language Technologies are used and developed to different extents in the several application domains. Some fields are well supported financially and with regard to the men and women power while others stay underrepresented. The project European LT Domains 2023 (EuLTDom2023) conducted by the Faculty of Humanities and Social Sciences of the University of Zagreb aims to collect and analyse publications using LTs in varying domains, identify currently underrepresented domains and present a current snapshot of the LT distribution among domains.

Pangeanic present a project for the Generation of a large speech corpus for Spain languages using Data Augmentation. For the development of robust speech technology, large speech corpora containing different kinds of voices, set-ups and background noises are needed. Pangeanic will create guidelines for building an extensive speech dataset and transcript through data augmentation along the way of building a diverse dataset for Euskera, Asturian, Catalan, Galician and Aranese. The data augmentation process entails adding variations and different noise types to the voice recordings in order to simulate potential obstructive elements that may impede accurate processing of voice data. The project will entail a planning phase, recording sessions, data augmentation and, lastly, generating scripts for data analysis and creating a guideline detailing the steps of the projects, including the lessons learned along the way. The goal will be to create a dataset and guidelines that can be used for future speech data collection.

Lastly, if you have a few minutes to spare, please consider filling out this Survey about Computational Facilities for Language Technology. It will be used for collecting the data within the European Language Equality 2 project (https://european-language-equality.eu/) and will result in a snapshot of the current situations and relevant recommendations for HPC use in NLP/LT.

From the SRIA

Research Topic: Machine Translation Recommendations

Machine translation can be a helpful tool to overcome language barriers within a multilingual Europe. In addition to plain translation, we envision the translation systems to be able to take cultural context-sensitive aspects into account. Multimodal MT systems could accommodate as a connection between external and internal information. Multimodality will also facilitate the inclusion of sign languages and more direct approaches to translate speech from one language to another.

To measure our current MT systems, new quality evaluation metrics need to be developed and evaluated. Those future metrics should reflect better human judgements, should ideally not depend on human reference translation, and be constructed in a flexible manner in order to be adaptable to different contexts.

To develop fair, multilingual, multimodal, context-sensitive MT systems in the future, we recommend more research focussing on those aspects for European languages. To include all European languages, efforts for low-resource MT need to be strengthened and further financially supported.

You can read more about all SRIA recommendations here or take a look at the full document.

If you would like to voice your support for the ELE Programme and its goal and vision to achieve digital language equality in Europe by 2030, please consider filling out the endorsement form by clicking the button below and become a listed supporter on the ELE website:

Click here to endorse the ELE SRIA

Upcoming Events

Data Spaces Symposium 21-23 March, The Hague, Netherlands (with participation of the new Language Data Space project)
Workshop on Profiling Second Language Vocabulary and Grammar, 20-21 April, Gothenburg, Sweden
10th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, 21-23 April, Poznań, Poland
3rd International Conference ‘Language in the Human-Machine Era’ (LITHME), 15-16 May, Groningen, Netherlands
META-FORUM 2023, 27 June, Brussels, Belgium

If you have an event that you think the European language technology community should know about, get in touch with us to have it featured in this newsletter.

Next edition

The next ELT newsletter will be sent out on 4 April 2022. Until then, follow our ELT social media accounts (as linked below) for the latest news!

Want to learn more? Visit https://european-language-technology.eu
or contact us directly.

The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).

The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.