Digital Language Equality in Europe by 2030:

Strategic Agenda and Roadmap

Recommendations

The main basic requirement of the future ELE Programme is a collaboration between the EU/EC and all participating countries and regions. Moreover, funding and further investment is needed on all levels. Funding on the level of the EU should enable overarching coordination and EU-wide technological infrastructure. It should cover the topics which require pan-European coordination such as shared tasks, protocols, multilingual dataset creation, etc. Increased coordination on European level is needed because language communities are still too fragmented and small. Further effort should be invested into the establishment of the adequate policy-making, distributed research infrastructures and technological platforms like ELG, with flexible access to sufficient HPC facilities. Additionally, national and regional funding should complement the European funding with regard to language-specific research and development. The implementation of these aspects were described, among others, in the ELE language reports.

The following sections break down how concrete recommendations for such a shared programme should look like.

Recommendation Sections:

Policy
Governance Model
Technology and Data
Infrastructure
Research
Machine Translation
Speech Processing
Text Analytics and Natural Language Understanding
Implementation

Policy Recommendations

To reinforce European leadership in LT by establishing the ELE programme as a large-scale, long-term coordinated funding programme for research, development, innovation and education with the societal goal of digital language equality and the scientific goal of deep natural language understanding.
To ensure comprehensive EU-level legal protection for the more than 60 regional and minority languages.
To empower recognition of the collective rights of national and linguistic minorities in the digital world (including sign languages).
To encourage mother-tongue teaching for speakers of official and non-official languages of the EU.
To safeguard sufficient funding to support the new technological approaches, based on increased computational power and better access to sizeable amounts of data.
To develop specific programmes within current funding schemes, especially Horizon Europe and Digital Europe (including the Recovery Plan for Europe), to boost long-term basic research as well as knowledge and technology transfer between countries and regions, and between academia and industry.
To define and develop a BLARK-like minimum set of language resources and capacities that all European languages should possess.
To develop common policy actions and clear protocols for language data sharing by public administration at all levels. Language data should be included as a high-value data category in the Open Data Directive (2019/1024/EU).
To develop clear and robust protocols to ensure flexible access to sufficient GPU-based HPC infrastructure and robust protocols to process sensible data.
To enable and empower European SMEs and startups to easily access and use LT in order to grow their businesses online independent of language barriers.
To create the necessary appealing conditions to attract and retain qualified and diverse international LT personnel in Europe.
To ensure mechanisms to achieve European LT sovereignty.

Governance Model

To structure the ELE Programme as a shared, collaborative and coordinated programme between the EU and participating countries and regions.
To allocate the area of multilingualism, linguistic diversity and language technology to the portfolio of a EU Commissioner.
To spark a large lobby for EU regional and minority languages.
To create a pan-European network of research centers to facilitate the coordination of the ELE programme at all levels.
To promote a distributed centre for linguistic diversity that will strengthen awareness of the importance of lesser-used, regional and minority languages.
To design and apply new forms of research funding and organisation to ease the transition from application-oriented basic research to commercially focused technology.
To construct a multilingual LT benchmark, a European “SuperGLUE”-style shared benchmark, that tracks progress.
To strongly encourage all EC-funded projects to have a language diversity plan and to include direct or associated partners from a less-widely spoken language.
To facilitate EU Member States’ acquisition of LT for their local industries without depending on non-European technology providers.

Technology and Data Recommendations

To develop high-performance applications (in terms of speed and quality) for all languages that respect safety, security and privacy.
To address the lack of available data and define the minimum of language resources and capacities that all European languages should possess.
To add more focus on systematic language data collection (text, dialogue, multimodal) and exploit automatic data generation (synthetic data), crowd-sourcing and translation of data.
To ensure efficient adaptations to applications, both in terms of language, domain, efficiency, power consumption, ease of maintenance, and quality assurance.
To develop methods to overcome the unequal data availability, by focusing on, e.\,g., annotation transfer, multilingual models preserving quality, few-shot or zero-shot learning.
To unleash the power of public sector data, data from broadcasters, social media, publishers etc.
To enforce open ecosystems, open source, open access, open standards and interoperability.
To focus on research in data bias for strengthening inclusiveness and accessibility.
To focus upon green LT with a small compute and carbon footprint (e.g., model compression). Green LT (i. e. technologies with low-demand computational footprint).
To foster publicly available resources that facilitate innovation and research for both commercial and non-commercial actors.
To develop large open-source language models that work for all EU languages, optimised in in terms of compute time and cost.
To develop new methodologies for transfer and adaptation of resources and technologies to other domains and languages.
To define the minimum language resources that all European languages should possess in order to prevent digital extinction.
To support the coordination between research and industry to enhance the digital possibilities for language translation and open access to the data required for technological advancement.
To encourage administrations at all levels should improve access to online services and information in different languages.

Infrastructure Recommendations

To strengthen existing and create new research infrastructures (RIs) and LT platforms that support research and development activities, including collaboration, knowledge sharing, and open access to data and technologies.
To ensure sufficient operational capacity, especially for large language models.
To fill the identified gaps in data,language resources, and knowledge graphs create a future path for Europe towards comprehensive and interlinked data infrastructures.
The technology vision of an integrated and interoperable data infrastructure shall follow the idea of a Semantic Data Fabric including rich semantics, and thereby context and meaning as well as dynamic and augmented metadata and data management.
To ensure flexible access to GPU-based HPC facilities and a more suitable computing infrastructure.
To create an European network of centres of excellence in LT to increase industry visibility, design national research agendas and employ a European Data Strategy.

Overall Research Recommendations

To gather and make available the necessary critical mass of resources in terms of data, computing facilities, and expertise from pan-European LT research labs and centres, with the support from the EC as well as national and regional administrations.
To create sufficient multilingual and multi-modal data of quality (responsible, legal, diverse, unbiased, ethical, representative, etc.), in all European languages and domains (media, health, legal, education, etc.).
To provide flexible access to HPC facilities in the form of clusters of high capacity GPUs for LT research and industry. HPC facilities should provide clear and robust protocols to process sensitive data.
To develop better benchmarks and datasets (ethical, responsible, legal, etc.) for all languages, domains, tasks and modalities.
To combine interactive LT (conversational AI) with text, knowledge, and multimedia technologies for a new generation of applications that can address the deeper questions of communication, common sense and reasoning.
To encourage responsible, green, trustworthy, unbiased, inclusive, non-discriminatory LT/AI, making interpretability and explainability of AI models a priority.
To develop further the areas of Responsible AI and Explainable AI by combining of statistical and symbolic AI in multilingual environments to provide AI-based applications that bring accurate results and benefits for research, industry, and society.
To focus on methods and learning architectures to overcome the highly unequal data availability, such as annotation transfer, synthetic data and their proper use in machine learning, multilingual models preserving quality and coverage and few-shot or zero-shot learning.
To focus on Green LT and investigate new efficient methods to extend, reuse and adapt existing pre-trained language models or develop new ones with much reduced carbon footprint.
To develop language and culture-specific technologies that cover more linguistic phenomena and text types, focusing on accessibility, through sign language, avatar technology etc.

Machine Translation

To develop direct and near-real-time speech-to-speech MT and adaptive MT, where the system learns from linguists’ input.
To develop low-resource MT, by deepening research on embedding projection and structural organisation of embeddings to apprehend how structurally different languages and their respective embedding spaces can be mapped on to one another.
To provide transparency of AI models with regard to accuracy and fairness.
To move towards context-aware methodologies that goes beyond text data and include images, videos, tables, etc. by developing multimodal MT systems.
To reframe MT, and NLP in general, as a quantum computing problem.

Speech Processing

To enhance speech resources and create acoustic models to cover a wide variety of languages, including non-standard varieties and dialects.
To develop good, natural synthetic voices, allowing users to obtain content in their spoken languages.
To improve context modeling to handle the translation across larger volumes of text.
To improve the handling of audio conditions currently perceived as difficult (e. g., multiple simultaneous speakers in noisy environments speaking spontaneously and highly emotionally in a mix of languages).
To support research in the direction of combining speech, NLU and NLP with other modalities,
such as image and vision.
To address privacy and security threats in areas of speech synthesis, voice cloning and speaker recognition.

Text Analytics and Natural Language Understanding

To increase the adoption of approaches based on self-supervised, zero-shot, and few-shot learning.
To support research in NLU which integrates speech, NLP, and contextual information as well as additional modes of perception.
To strengthen basic research in neurosymbolic approaches to NLP/NLU, including grounding and the use of human-understandable databases and sources.
To create large open-access language models for all European languages (for fine-tuning and downstream tasks), datasets (for training and testing), multilingual models, models that include symbolic knowledge, and models that include discourse features.
To strengthen progress in reinforcement-based learning, novel dialogue management strategies, and situation-aware natural language generations.
To strengthen interdisciplinary research and enable better modeling of multimodal environment.

Implementation Recommendations

To structure the 9 year long ELE Programme into 3 phases of 3 years each.
To facilitate discussions between the EU/EC and participating countries to define needs and goals as well as the financial setup.
To encourage participating countries to invest into the development of LLMs, data sets, technologies, tools for their own languages.
To have the EU establish binding legislation to encourage or ensure participation.
To have the EU invest into pan-European coordination of all language-specific projects and initiatives, support mechanisms, infrastructures, data procedures, cross-cutting projects etc. and provide flex funds for bootstrapping poorly supported languages.
To structure the ELE Programme into 6 themes covering: Language Modelling, Data and Knowledge, Machine Translation, Text Understanding, Speech and Infrastructure. To support each theme by coordination actions (CSAs), research actions (RIAs) as well as actions for innovation and deployment (IAs).