Newsletter #31 – October 2023
Dear reader,

Another month of Language Technology news and updates is upon us! Read about the United Nations’ plans for a High-Level Advisory Body on Artificial Intelligence, the European Union’s concerns regarding the risks of deepfakes for future elections, the latest on the European AI Act, and much more.

Our social media highlights feature a useful NLP/LLM resource list as well as an interesting blog post about LLM’s difficulties with maths.

Nature Machine Intelligence published a paper on the use of AI to improve the verifiability of Wikipedia articles utilising a combination of an information retrieval system and a language model.

Our ELG resource of the month is a dataset consisting of sentences from the parliamentary proceedings of Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom.


With best regards


Georg Rehm
 
The European Language Data Space (LDS) Newsletter

The European Language Data Space initiative has its own monthly newsletter with information on the latest developments in secure, privacy-preserving language data sharing and use across Europe. 

This month’s edition features an interview with Philippe Gelin about the vision of LDS that readers of this newsletter might be interested to learn more about.  

We’d like to invite you to subscribe to the newsletter for updates on LDS implementation, success stories, events, and more!

Language Technology and NLP in the news
Social media highlights
Publications

Nature Machine Intelligence published a paper on the use of AI to improve the verifiability of Wikipedia articles. Fabio Petroni and his co-authors present a novel AI system named SIDE (System for Improving Document Evidence) that aims to enhance the verifiability of claims made on Wikipedia. The authors address the challenge of maintaining and improving the quality of Wikipedia references and highlight the need for improved tools to assist editors in the process. 

The SIDE system is powered by a combination of an information retrieval system and a language model. It can identify Wikipedia citations that are likely insufficient to support their respective claims and subsequently suggest better citations from the web. The model is trained on existing Wikipedia references, leveraging the collective knowledge of thousands of Wikipedia editors. The paper demonstrates that for the top 10% of claims identified as most likely unverifiable by SIDE, humans prefer the system's suggested alternatives over the originally cited references 70% of the time. Moreover, in a demonstration with the English-speaking Wikipedia community, SIDE's first citation recommendation is preferred twice as often as the existing Wikipedia citation for the same top 10% most likely unverifiable claims according to SIDE.

The paper also outlines several avenues for future research, including extending the system to support multiple languages other than English.

Selected new tools and resources on the
European Language Grid

The multilingual sentiment dataset of parliamentary debates ParlaSent 1.0 – This month’s resource is a dataset consisting of mid-length sentences from the parliamentary proceedings of Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom, annotated with a 6-level sentiment schema you can read more about on the resource’s ELG page.

The data coming from the parliaments of Bosnia and Herzegovina, Croatia and Serbia are organised as a single parliament group, named "BCS", due to the similarity of the official languages in these countries. For each of the six parliaments / parliament groups, 2,600 training instances were annotated by two annotators, with one additional conflict resolution step. While these training instances were sampled via sentiment lexicons to contain more sentiment-loaded sentences, two test sets were randomly sampled from selected parliaments, one from the BCS parliament group, another from the parliament of the United Kingdom. Each test set consists of 2,600 sentences, annotated by one highly trained annotator. Training datasets were internally split into "train", "dev" and "test" portions" for performing language-specific experiments.

Upcoming Events

If you have an event that you think the European language technology community should know about, get in touch with us to have it featured in this newsletter.

Next edition

The next ELT newsletter will be sent out on 28 November 2023. Until then, follow our ELT social media accounts (as linked below) for the latest news!


Want to learn more? Visit https://european-language-technology.eu 
or contact us directly.
Website
YouTube
Twitter
LinkedIn
Copyright © 2022 ELE and ELG Consortium, All rights reserved.
Why did I get this email?
The European Language Grid is an initiative funded by the European Union’s Horizon 2020 programme under grant agreement № 825627 (ELG).
The European Language Equality Project has received funding from the European Union under the grant agreement № LC-01641480 – 101018166 (ELE)
Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.