Coordination of the integration of lexicographic databases into large language models to strengthen AI-driven linguistic infrastructure.
PORTULAN CLARIN LVT Consortium
Reference: LISBOA2030-FEDER-01316900
Funding body: European Regional Development Fund (ERDF); Portugal 2030 – Lisbon Regional Programme 2021–2027
Coordination: António Branco
Coordination: Amália Mendes
Coordination: Ana Salgado (ACL)
Participating institutions: ACL, Faculty of Sciences (António Branco, General Director), Faculty of Arts and Humanities of the University of Lisbon (Amália Mendes, Executive Director)
Position: Executive Director
Period: 14 April 2025 – 2027
Funding: €498,577.68
Website: https://portulanclarin.net/who/#staff
This operation aims to further develop PORTULAN CLARIN by expanding the infrastructure with new processing services and scientific data that harness advances in artificial intelligence. My responsibility includes coordinating the Work Package (WP) dedicated to Lexical Databases and Large Language Models (LLMs), with a focus on integrating lexicographic resources into AI processes.
Main responsibilities and tasks to be carried out:
-
Coordination of the integration of lexical databases in LLM training, ensuring the incorporation of rigorous and up-to-date lexicographic data.
-
Preparation of lexical databases, with particular attention to the identification, cleaning and normalisation of lexicographic data from the Academy’s dictionary.
-
Extraction of semantic information, such as synonymy and hypernymy, using advanced natural language processing (NLP) techniques.
-
Use of the Academy’s Dictionary as a fundamental resource, given its lexical richness and precision, to train models that improve semantic understanding algorithms and generative systems.
-
Development of data augmentation strategies, including the generation of synthetic data through chatbots.
-
Creation of textual databases aligned with the cultural specificities present in lexical databases, promoting semantic contextualisation.
-
Fine-tuning of models for specific tasks, such as commonsense reasoning and text simplification.