*|MC:SUBJECT|*

Newsletter #20 – October 2022

Dear reader,

This month, the ELG section is taken over by the German project and initiative OpenGPT-X, which develops large-scale language models for Europe, concentrating on the German language.

The International Day of Sign Languages that was celebrated on 23 September as part of the International Week of Deaf People further highlighted the importance of Europe’s linguistic diversity and overall multilingual setup.

This month’s featured resource is a corpus consisting of 5,485 texts written by students in Slovenian, partially annotated by teachers.

Finally, the European Language Equality initiative opened a call for SRIA Contribution Projects last week. We look forward to the proposals submitted to this call, the submission deadline is 29 November 2022.

With best regards

Georg Rehm

Language Technology and NLP in the news

“Artificial intelligence is helping scientists decode animal languages” – Popular Science, 1 September 2022
“Why large AI language models don’t lead to human-like AI” – The Decoder, 4 September 2022
“The EU’s AI Act could have a chilling effect on open source efforts, experts warn” – TechCrunch, 6 September 2022
“In conversation with AI: building better language models” – Deepmind, 6 September 2022
“BigScience’s OpenRAILs Can Promote Responsible Use of AI” – Analytics India Magazine, 7 September 2022
“AI researchers improve method for removing gender bias in natural language processing” – University of Alberta, 8 September 2022
“Ai Can Independently Learn And Recognize Language Norms And Patterns” – Dataconomy, 8 September 2022
“An AI can decode speech from brain activity with surprising accuracy” – Science News, 8 September 2022
“Amazon Unveils New AI Language Model that Beats GPT-3” – Analytics India Magazine, 8 September 2022
“Estonia crowdsources speech data for the preservation of the Estonian language” – GovInsider, 13 September 2022
“Of God And Machines” – The Atlantic, 15 September 2022
“So you want to be a prompt engineer: Critical careers of the future” – VentureBeat, 17 September 2022
“Theories of AI liability: It's still about the human element” – Reuters, 20 September 2022
“EU AI Act should ‘exclude general purpose artificial intelligence’ – industry groups” – Tech Monitor, 27 September 2022

“AI research looks to bridge gaps between signed and spoken languages” – Silicon Republic, 28 September 2022

Social media highlights

Multilingual workplaces sometimes lead to misunderstandings, but maybe this particular one should make its way into common usage.
A realist’s perspective on data wrangling.
Zoom’s live transcription feature is clearly a little out of its depth in a multilingual classroom, with some almost poetic results.
Language models are making headlines (see the ELG section below), but where did the term actually come from? Delip Rao investigates.
Impressive speech recognition results with OpenAI's Whisper.

General news

The diversity of Europe’s languages and the need to ensure they are all supported technologically is not only a driver of ELG, but also of one of the projects ELG/ELE collaborate with: OpenGPT-X. OpenGPT-X is a collaboration of business, science and technology to develop large-scale language models for Europe, combining the power of AI-based language technology with meeting the needs and values of a specifically European market. These include data protection and privacy, multilingualism, gender equality, and (as the name suggests) open sharing of services and data.

This development holds great potential for European research and industry (including small and medium enterprises), given the power of current language models, which can recognize, produce, translate and process language in a manner often hard to distinguish from humans. Using the emerging Gaia-X data infrastructure, businesses will be able to use and share data and services free of charge, in multiple languages and according to the highest European data protection standards to develop products and processes with a wide variety of language features (e.g., chatbots, digital assistants and personalised media reports). This will ensure Europe remains competitive with, rather than dependent on, language models from the US and China such as GPT-3 and Wu Dao 2.0. OpenGPT-X is funded by the German Federal Ministry of Economics and Climate Action (BMWK) from January 2022 to December 2024 as part of the funding program Innovative and Practical Applications and Data Spaces in the Gaia-X Digital Ecosystem.

September 23 saw celebrations of the International Day of Sign Languages as part of the wider International Week of Deaf People. Europe’s linguistic diversity of course includes the multitude of signed languages used across the continent, and ELG hosts over 100 sign language resources and tools. The 21 languages covered include a range of European sign languages, as well as many from further afield. Amongst these 100 tools and resources are the two ELG-funded pilot projects working on signed languages: a 3D motion capture dataset of Czech Sign Language (CSE), and a tool integrating Austrian Sign Language (ÖGS) and German Sign Language (DGS) definitions into online written texts.

Selected new tools and resources on the
European Language Grid

Developmental corpus Šolar 3.0 - The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. The information on school (elementary or secondary), subject, level (grade or year), type of text, region, and date of production is provided for each text. School essays form the majority of the corpus while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications, etc. Part of the corpus (2,094 texts) is annotated with teachers' corrections using a system of labels described in the attached document (in Slovenian). Teacher corrections were part of the original files and reflect real classroom situations of essay marking. Corrections were then inserted into texts by annotators and subsequently categorized. As opposed to the previous version 2.0, which was also available in two separate versions, i.e. Šolar Clear 2.0 (http://hdl.handle.net/11356/1219), with the students' text without teacher corrections, and Šolar Error (http://hdl.handle.net/11356/1231), with only those sentences that have teacher corrections, the current version has a different encoding, error annotations were manually edited in cca. 350 texts, and the linguistic annotation was performed with a new tool.

General news

The European Language Equality Initiative opened a call for SRIA Contribution Projects. Research organisations, companies legally established in the EU, NGOs and incorporated associations are eligible for funding. Projects should last 2-3 months with a budget of up to €25,000. Become a part of the European Language Equality initiative, register, and submit your project proposals until 29th November 2022. Visit the ELE website and be sure to have a look at the linked documentation for more in-depth information.

The International Day of Sign Languages provided a great reminder that the strengthening of sign languages is an important part of European Language Equality, also taking into account their multimodal nature.

Upcoming Events

October 6-7: Human Language Technologies - the Baltic Perspective https://hlt2022.tilde.eu/conference
October 10-12: CLARIN Annual Conference (Prague)
https://www.clarin.eu/content/clarin-annual-conference
October 11-13: 19th Annual conference of EFNIL (Vilnius)
October 20: Complexity in Language Variation and Change (Palma de Mallorca)
October 26: Irish Language Technology workshop (Dublin) https://www.eventbrite.ie/e/ceardlann-teicneolaiochta-gaeilgeirish-language-technology-workshop-tickets-409828326557
November 11-9: Translating Europe Forum (Brussels/online)
https://ec.europa.eu/info/events/2022TEF_en
November 30 - December 2: 28. LIPP Symposium: Sprache in der digitalen Welt – Vermittlung, Variation, Politik (Munich)
December 1-3: Digital Research Data and Human Sciences – Diversity of Methods and Materials (Jyväskylä)
April 21-23, 2023: 10th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (Poznań)

If you have an event that you think the European language technology community should know about, get in touch with us to have it featured in this newsletter.

Next edition

The next ELT newsletter will be sent out on 1 November 2022. Until then, follow our ELT social media accounts (as linked below) for the latest news!

Want to learn more? Visit https://european-language-technology.eu
or contact us directly.

The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).

The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.