Newsletter #20 – October 2022

Dear reader,
 

This month, the ELG section is taken over by the German project and initiative OpenGPT-X, which develops large-scale language models for Europe, concentrating on the German language.

The International Day of Sign Languages that was celebrated on 23 September as part of the International Week of Deaf People further highlighted the importance of Europe’s linguistic diversity and overall multilingual setup.

This month’s featured resource is a corpus consisting of 5,485 texts written by students in Slovenian, partially annotated by teachers.

Finally, the European Language Equality initiative opened a call for SRIA Contribution Projects last week. We look forward to the proposals submitted to this call, the submission deadline is 29 November 2022.
 

With best regards
 

Georg Rehm

Language Technology and NLP in the news
Social media highlights
  • Multilingual workplaces sometimes lead to misunderstandings, but maybe this particular one should make its way into common usage.

  • A realist’s perspective on data wrangling.

  • Zoom’s live transcription feature is clearly a little out of its depth in a multilingual classroom, with some almost poetic results.

  • Language models are making headlines (see the ELG section below), but where did the term actually come from? Delip Rao investigates.

  • Impressive speech recognition results with OpenAI's Whisper.

General news

The diversity of Europe’s languages and the need to ensure they are all supported technologically is not only a driver of ELG, but also of one of the projects ELG/ELE collaborate with: OpenGPT-X. OpenGPT-X is a collaboration of business, science and technology to develop large-scale language models for Europe, combining the power of AI-based language technology with meeting the needs and values of a specifically European market. These include data protection and privacy, multilingualism, gender equality, and (as the name suggests) open sharing of services and data.

This development holds great potential for European research and industry (including small and medium enterprises), given the power of current language models, which can recognize, produce, translate and process language in a manner often hard to distinguish from humans. Using the emerging Gaia-X data infrastructure, businesses will be able to use and share data and services free of charge, in multiple languages and according to the highest European data protection standards to develop products and processes with a wide variety of language features (e.g., chatbots, digital assistants and personalised media reports). This will ensure Europe remains competitive with, rather than dependent on, language models from the US and China such as GPT-3 and Wu Dao 2.0. OpenGPT-X is funded by the German Federal Ministry of Economics and Climate Action (BMWK) from January 2022 to December 2024 as part of the funding program Innovative and Practical Applications and Data Spaces in the Gaia-X Digital Ecosystem.

September 23 saw celebrations of the International Day of Sign Languages as part of the wider International Week of Deaf People. Europe’s linguistic diversity of course includes the multitude of signed languages used across the continent, and ELG hosts over 100 sign language resources and tools. The 21 languages covered include a range of European sign languages, as well as many from further afield. Amongst these 100 tools and resources are the two ELG-funded pilot projects working on signed languages: a 3D motion capture dataset of Czech Sign Language (CSE), and a tool integrating Austrian Sign Language (ÖGS) and German Sign Language (DGS) definitions into online written texts. 
 

Selected new tools and resources on the
European Language Grid

Developmental corpus Šolar 3.0 - The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. The information on school (elementary or secondary), subject, level (grade or year), type of text, region, and date of production is provided for each text. School essays form the majority of the corpus while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications, etc. Part of the corpus (2,094 texts) is annotated with teachers' corrections using a system of labels described in the attached document (in Slovenian). Teacher corrections were part of the original files and reflect real classroom situations of essay marking. Corrections were then inserted into texts by annotators and subsequently categorized. As opposed to the previous version 2.0, which was also available in two separate versions, i.e. Šolar Clear 2.0 (http://hdl.handle.net/11356/1219), with the students' text without teacher corrections, and Šolar Error (http://hdl.handle.net/11356/1231), with only those sentences that have teacher corrections, the current version has a different encoding, error annotations were manually edited in cca. 350 texts, and the linguistic annotation was performed with a new tool.

General news
The European Language Equality Initiative opened a call for SRIA Contribution Projects. Research organisations, companies legally established in the EU, NGOs and incorporated associations are eligible for funding. Projects should last 2-3 months with a budget of up to €25,000. Become a part of the European Language Equality initiative, register, and submit your project proposals until 29th November 2022. Visit the ELE website and be sure to have a look at the linked documentation for more in-depth information.

The International Day of Sign Languages provided a great reminder that the strengthening of sign languages is an important part of European Language Equality, also taking into account their multimodal nature.
 
Upcoming Events
If you have an event that you think the European language technology community should know about, get in touch with us to have it featured in this newsletter.
Next edition

The next ELT newsletter will be sent out on 1 November 2022. Until then, follow our ELT social media accounts (as linked below) for the latest news! 


 

Want to learn more? Visit https://european-language-technology.eu 
or contact us directly.
Website
YouTube
Twitter
LinkedIn
Copyright © 2022 ELE and ELG Consortium, All rights reserved.
Why did I get this email?
The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).
The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)
Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.