*|MC:SUBJECT|*

Newsletter #31 – October 2023

Dear reader,

Another month of Language Technology news and updates is upon us! Read about the United Nations’ plans for a High-Level Advisory Body on Artificial Intelligence, the European Union’s concerns regarding the risks of deepfakes for future elections, the latest on the European AI Act, and much more.

Our social media highlights feature a useful NLP/LLM resource list as well as an interesting blog post about LLM’s difficulties with maths.

Nature Machine Intelligence published a paper on the use of AI to improve the verifiability of Wikipedia articles utilising a combination of an information retrieval system and a language model.

Our ELG resource of the month is a dataset consisting of sentences from the parliamentary proceedings of Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom.

With best regards

Georg Rehm

The European Language Data Space (LDS) Newsletter

The European Language Data Space initiative has its own monthly newsletter with information on the latest developments in secure, privacy-preserving language data sharing and use across Europe.

This month’s edition features an interview with Philippe Gelin about the vision of LDS that readers of this newsletter might be interested to learn more about.

We’d like to invite you to subscribe to the newsletter for updates on LDS implementation, success stories, events, and more!

Language Technology and NLP in the news

“UK’s new AI principles target ‘pro-innovation’ edge over the EU” – TNW, 19 September 2023
“How the U.N. Plans to Shape the Future of AI” – Time, 21 September 2023
“Poland investigates OpenAI over privacy concerns” – Reuters, 21 September 2023
“EU Defers Decision on Making Basque, Catalan, Galician Official EU Languages” – Slator, 26 September 2023
“As European doctors retire, could AI help to solve a health workforce shortage?” – Euronews Next, 26 September 2023
“Can Large Language Models Do Simultaneous Machine Translation?” – Slator, 26 September 2023
“Deepfake election risks trigger EU call for more generative AI safeguards” – TechCrunch, 26 September 2023
“European Central Bank assembles ‘infinity team’ to identify GenAI applications” – TNW, 28 September 2023
“Europe to rival U.S. and China with 'Responsible AI'?” – Hamburg News, 5 October 2023
“Translation Has A Viral Moment” – Slator, 6 October 2023
“U.S. Warns E.U.’s Landmark AI Policy Will Only Benefit Big Tech” – Time, 6 October 2023
“How Can AI Help UNESCO in Europe?” – Finance Magnates, 9 October 2023
“Scientific experimentation with generative AI” – CEPR, 16 October 2023
“How Prompting by Humans Improves Machine Translation” – Slator, 16 October 2023
“AI Act: EU countries headed to tiered approach on foundation models amid broader compromise” – Euractiv, 17 October 2023
“EU Elections at Risk with Rise of AI-Enabled Information Manipulation” – ENISA, 19 October 2023
“The AI Act’s crunch time, submarine cables financing” – Euractiv, 20 October 2023

Social media highlights

Check out this extensive, continuously updated Github list of NLP and LLM resources, datasets, tutorials, and more.
Does it matter if language teachers are native or non-native speakers? Read about it in this blog post by Sanako.
Watch Jakub Absolon, CEO of ASAP-translation.com, discuss the role of post-edited machine translation and why he thinks full post-editing is a misnomer on the 20 October episode of SlatorPod.
Read this blog post by Gary Martin explaining why maths prove to be so difficult for LLMs.
September 23 was Multilingualism Day 2023!
Automatic translation from LLMs, albeit handy, is still flawed and comes up with some pretty funny translations sometimes.
ChatGPT’s Halloween Costume ideas might be a little questionable. It’s giving “my knowledge was last updated in September 2021”.

Publications

Nature Machine Intelligence published a paper on the use of AI to improve the verifiability of Wikipedia articles. Fabio Petroni and his co-authors present a novel AI system named SIDE (System for Improving Document Evidence) that aims to enhance the verifiability of claims made on Wikipedia. The authors address the challenge of maintaining and improving the quality of Wikipedia references and highlight the need for improved tools to assist editors in the process.

The SIDE system is powered by a combination of an information retrieval system and a language model. It can identify Wikipedia citations that are likely insufficient to support their respective claims and subsequently suggest better citations from the web. The model is trained on existing Wikipedia references, leveraging the collective knowledge of thousands of Wikipedia editors. The paper demonstrates that for the top 10% of claims identified as most likely unverifiable by SIDE, humans prefer the system's suggested alternatives over the originally cited references 70% of the time. Moreover, in a demonstration with the English-speaking Wikipedia community, SIDE's first citation recommendation is preferred twice as often as the existing Wikipedia citation for the same top 10% most likely unverifiable claims according to SIDE.

The paper also outlines several avenues for future research, including extending the system to support multiple languages other than English.

Selected new tools and resources on the
European Language Grid

The multilingual sentiment dataset of parliamentary debates ParlaSent 1.0 – This month’s resource is a dataset consisting of mid-length sentences from the parliamentary proceedings of Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom, annotated with a 6-level sentiment schema you can read more about on the resource’s ELG page.

The data coming from the parliaments of Bosnia and Herzegovina, Croatia and Serbia are organised as a single parliament group, named "BCS", due to the similarity of the official languages in these countries. For each of the six parliaments / parliament groups, 2,600 training instances were annotated by two annotators, with one additional conflict resolution step. While these training instances were sampled via sentiment lexicons to contain more sentiment-loaded sentences, two test sets were randomly sampled from selected parliaments, one from the BCS parliament group, another from the parliament of the United Kingdom. Each test set consists of 2,600 sentences, annotated by one highly trained annotator. Training datasets were internally split into "train", "dev" and "test" portions" for performing language-specific experiments.

Upcoming Events

European Big Data Value Forum - Data and AI in Action: Sustainable Impact and Future Realities, 25 – 27 October, Valencia, Spain
NGI Forum 2023: Unlocking the power of Digital Commons, 15 – 16 November, Brussels, Belgium

If you have an event that you think the European language technology community should know about, get in touch with us to have it featured in this newsletter.

Next edition

The next ELT newsletter will be sent out on 28 November 2023. Until then, follow our ELT social media accounts (as linked below) for the latest news!

Want to learn more? Visit https://european-language-technology.eu
or contact us directly.

The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).

The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.