Felipe Bravo-Márquez

Felipe Bravo-Márquez

Especialidad: Procesamiento de lenguaje natural, aprendizaje automático.
Felipe es  profesor asociado en la Universidad de Chile. Ha trabajado en el análisis de opiniones y emociones en medios de comunicación social, y su trabajo ha sido publicado en conferencias y revistas destacadas. También ha participado en comités de programas de conferencias importantes en inteligencia artificial y procesamiento del lenguaje natural. https://felipebravom.com/

PUBLICACIONES

Participatory society has often been regarded positively, frequently associated with the ideals of a more democratic and equitable civilization. Nevertheless, the idea of participation may act as a two-sided phenomenon in terms of empowerment, especially in the realm of social media platforms. This dichotomy is evident as increased participation often leads to a rise in offensive and divisive language, reflecting the challenging balance between open dialogue and the maintenance of respectful discourse on these platforms. In this work, we comprehensively examine the use of offensive language during a highly polarizing event on two online platforms, Twitter and Whatsapp. In our study, we focus in the 2021 Chilean Presidential Elections, a political event where candidates from two opposing parties faced each other. Using a state-of-the-art model and all available labeled data in literature, we determine the level of offensive language across platforms and parties. Our results show that Twitter messages contain, on average, up to 15% more of offensive language than Whatsapp.

Numerous datasets have been proposed to evaluate social bias in Natural Language Processing (NLP) systems. However, assessing bias within specific application domains remains challenging, as existing approaches often face limitations in scalability and fidelity across domains. In this work, we introduce a domain-adaptive framework that utilizes prompting with Large Language Models (LLMs) to automatically transform template-based bias datasets into domain-specific variants. We apply our method to two widely used benchmarks—Equity Evaluation Corpus (EEC) and Identity Phrase Templates Test Set (IPTTS)—adapting them to the Twitter and Wikipedia Talk data. Our results show that the adapted datasets yield bias estimates more closely aligned with real-world data. These findings highlight the potential of LLM-based prompting to enhance the realism and contextual relevance of bias evaluation in NLP systems.

Publisher:  IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) Link>

ABSTRACT

Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.

In this paper, we present a comprehensive comparison between specialized Lexical Semantic Change Detection (LSCD) models and Large Language Models (LLMs) for the LSCD task. In addition to comparing models, we also investigate the role of automatic prompt selection for improving LLM performance. We evaluate three approaches: Average Pairwise Distance (APD), Word-in-Context (WiC), and Word Sense Induction (WSI). Using Spearman correlation as the evaluation metric, we assess the performance of Mixtral, Llama 3.1, Llama 3.3, and specialized LSCD models across English and Spanish datasets. Our results show that by using prompt optimization and LLMs, we achieve state-of-the-art performance for the English dataset and outperform specialized LSCD models at the annotation level in the same dataset. For Spanish, specialized models outperform LLMs across all three approaches—WiC, APD, and WSI—indicating that specialized LSCD models are still more effective for semantic change detection in Spanish.

The increasing use of Machine Learning (ML) in sensitive domains such as healthcare, finance, and public policy has raised concerns about the transparency of automated decisions. Explainable AI (XAI) addresses this by clarifying how models generate predictions, yet most methods demand technical expertise, limiting their value for novices. This gap is especially critical in no-code ML platforms, which seek to democratize AI but rarely include explainability. We present a human-centered XAI module in DashAI, an open-source no-code ML platform. The module integrates three complementary techniques, which are Partial Dependence Plots (PDP), Permutation Feature Importance (PFI), and KernelSHAP, into DashAI's workflow for tabular classification. A user study (N = 20; ML novices and experts) evaluated usability and the impact of explanations. Results show: (i) high task success (\geq80\%) across all explainability tasks; (ii) novices rated explanations as useful, accurate, and trustworthy on the Explanation Satisfaction Scale (ESS, Cronbach's \alpha = 0.74, a measure of internal consistency), while experts were more critical of sufficiency and completeness; and (iii) explanations improved perceived predictability and confidence on the Trust in Automation scale (TiA, \alpha = 0.60), with novices showing higher trust than experts. These findings highlight a central challenge for XAI in no-code ML, making explanations both accessible to novices and sufficiently detailed for experts.

The rapid growth of the language agent field, driven by advances in Large Language Models (LLMs), has led to agent designs that often rely on ad hoc methods. This lack of structure makes it challenging to understand, compare, reuse, and evolve these agents effectively, highlighting the need for a standardized framework to describe their architectures. This paper introduces FALAA (Framework for the Abstraction of Language Agent Architectures), a dual-level specification framework aimed at addressing these challenges by proposing a standardized structure and a description methodology that abstracts and describe LLM-based agent architectures using six essential components: Planner, Executor, Evaluator, Reflector, Memory, and Environment. Using this proposed structure, FALAA leverages UML (Unified Modeling Language) and OCL (Object Constraint Language) to provide a description methodology composed by two levels: a (1) conceptual description level, which visually represents the standardized components and behaviors of language agents through UML class and sequence diagrams, and a (2) formal specification level, which employs OCL to define invariants, conditions, and complex behaviors beyond UML's expressive capacity. By establishing a clear convention for the structure and responsibilities of essential agent's components hence along with a standardized description methodology , FALAA aims to eliminate ambiguity and ensure a reusable, unified standard for agent architectures. This framework pursue the goal of improved clarity, consistency, and precision in describing language agents, thereby supporting better comparison, evaluation, and development of LLM-based agents. The proposed approach is exemplified through a practical example and case studies, demonstrating its effectiveness in representing agent behaviors and architectures.

Clinical decision-making in healthcare often relies on unstructured text data, which can be challenging to analyze using traditional methods. Natural Language Processing (NLP) has emerged as a promising solution, but its application in clinical settings is hindered by restricted data availability and the need for domain-specific knowledge. Methods We conducted an experimental analysis to evaluate the performance of various NLP modeling paradigms on multiple clinical NLP tasks in Spanish. These tasks included referral prioritization and referral specialty classification. We simulated three clinical settings with varying levels of data availability and evaluated the performance of four foundation models. Results Clinical-specific pre-trained language models (PLMs) achieved the highest performance across tasks. For referral prioritization, Clinical PLMs attained an 88.85 % macro F1 score when fine-tuned. In referral specialty classification, the same models achieved a 53.79 % macro F1 score, surpassing domain-agnostic models. Continuing pre-training with environment-specific data improved model performance, but the gains were marginal compared to the computational resources required. Few-shot learning with large language models (LLMs) demonstrated lower performance but showed potential in data-scarce scenarios. Conclusions Our study provides evidence-based recommendations for clinical NLP practitioners on selecting modeling paradigms based on data availability. We highlight the importance of considering data availability, task complexity, and institutional maturity when designing and training clinical NLP models. Our findings can inform the development of effective clinical NLP solutions in real-world settings.

Word embeddings (WEs) often reflect biases present in their training data, and various bias mitigation and evaluation techniques have been proposed to address this. Existing benchmarks for comparing different debiasing methods overlook two factors: the choice of training words and model hyper-parameters. We propose a robust comparison methodology that incorporates them using nested cross-validation, hyper-parameter optimization, and the corrected paired Student's t-test. Our results show that when using our evaluation approach many recent debiasing methods do not offer statistically significant improvements over the original hard debiasing model.

There has been extensive work on human word sense annotation, i.e., manually labeling word uses in natural texts according to their senses. Such labels were primarily created for the tasks of Word Sense Disambiguation (WSD) and Word Sense Induction (WSI). However, almost all datasets annotated with word senses are synchronic datasets, i.e., contain texts created in a relatively short period of time and often do not provide the creation date of the texts. This ignores possible applications in diachronic-historic settings, where the aim is to induce or disambiguate historical word senses or changes in senses across time. To facilitate investigations into historical WSD and WSI and to establish connections with the task of Lexical Semantic Change Detection (LSCD), there is a crucial need for historical word sense-annotated data. Hence, we created a new reliable diachronic WSD/WSI dataset ‘DWUG DE Sense’. We describe the preparation and annotation and analyze central statistics. We then describe a thorough evaluation of different prediction systems for jointly solving both WSI and LSCD tasks. All our systems are based on a state-of-the-art architecture that combines Word-in-Context models and graph clustering techniques with different hyperparameter settings. Our findings reveal that using the WSI task as optimization criterion yields better results for both tasks even when the LSCD task is the focal point of optimization. This underscores that although both tasks are related, WSI seems to be more general and able to incorporate the LSCD task.

Large language models (LLM) are now a very common and successful path to approach language and retrieval tasks. While these LLM achieve surprisingly good results it is a challenge to use them on more constrained resources. Techniques to compress these LLM into smaller and faster models have emerged for English or Multilingual settings, but it is still a challenge for other languages. In fact, Spanish is the second language with most native speakers but lacks of these kind of resources. In this work, we evaluate all the models publicly available for Spanish on a set of 6 tasks and then, by leveraging on Knowledge Distillation, we present Speedy Gonzales, a collection of inference-efficient task-specific language models based on the ALBERT architecture. All of our models (fine-tuned and distilled) are publicly available on: https://huggingface.co/dccuchile.

agencia nacional de investigación y desarrollo
Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile