Perfiles - CENIA

Andrés Carvallo

Especialidad: Procesamiento de lenguaje natural , sistemas recomendadores, minería de datos, recuperación de información.

Email: andres.carvallo@cenia.cl

Andres Carvallo completó un Ph.D. in Computer Science en la Pontificia Universidad Católica de Chile en 2022. Es Investigador principal en el proyecto Fondecyt 3240001, titulado “Towards Unibiased Machine Learning and Natural Language Processing Algorithms: A Multilingual Approach to Fairness”. Este proyecto busca desarrollar algoritmos de aprendizaje automático y procesamiento de lenguaje natural (PLN) que sean más justos y transparentes, con el objetivo de mitigar sesgos raciales, de género, religiosos y otros tipos de discriminación en modelos de lenguaje. Adicionalmente, aborda la detección de discurso de odio mediante modelos explicables, combinando la clasificación de texto con el reconocimiento de entidades nombradas (NER) para identificar grupos atacados e intenciones ofensivas, promoviendo una inteligencia artificial más ética y responsable en contextos multilingües.

PUBLICACIONES

Publisher: Elsevier, Data in Brief Link>

ABSTRACT

The COVID-19 pandemic has underlined the need for reliable information for clinical decision-making and public health policies. As such, evidence-based medicine (EBM) is essential in identifying and evaluating scientific documents pertinent to novel diseases, and the accurate classification of biomedical text is integral to this process. Given this context, we introduce a comprehensive, curated dataset composed of COVID-19-related documents.

This dataset includes 20,047 labeled documents that were meticulously classified into five distinct categories: systematic reviews (SR), primary study randomized controlled trials (PS-RCT), primary study non-randomized controlled trials (PS-NRCT), broad synthesis (BS), and excluded (EXC). The documents, labeled by collaborators from the Epistemonikos Foundation, incorporate information such as document type, title, abstract, and metadata, including PubMed id, authors, journal, and publication date.

Uniquely, this dataset has been curated by the Epistemonikos Foundation and is not readily accessible through conventional web-scraping methods, thereby attesting to its distinctive value in this field of research. In addition to this, the dataset also includes a vast evidence repository comprising 427,870 non-COVID-19 documents, also categorized into SR, PS-RCT, PS-NRCT, BS, and EXC. This additional collection can serve as a valuable benchmark for subsequent research. The comprehensive nature of this open-access dataset and its accompanying resources is poised to significantly advance evidence-based medicine and facilitate further research in the domain.

RL1 2023

Ir a la publicación

Publisher: CEUR-WS Link>

ABSTRACT

The extraction and classification of important information from Spanish Electronic Clinical Narratives (ECNs) can be challenging due to the complexity of the clinical text and the limited availability of labeled data. In this paper, we introduce a chunked Named Entity Recognition model designed to parse and classify sections of ECNs into predefined categories. The model aims to improve section identification and classification accuracy within ECNs in the context of the IberLEF ClinAIS Task. Our system achieves a promising performance, obtaining a weighted B2 score of .6958, demonstrating its capability to accurately distinguish borders and boundaries between sections. The paper concludes with a comprehensive analysis of the results, discussing potential implications and suggesting directions for further improvements in clinical text analysis.

RL1 2023

Ir a la publicación

Publisher: Elsevier, SoftwareX Link>

ABSTRACT

CoTranslate is a web-based platform designed to efficiently label and review translations from language experts, with the aim of creating high-quality sentence-pair corpuses for training neural machine translation models. Utilizing Django backend and ReactJS frontend, the platform fosters collaboration among experts in translating and validating sentences. Focused on developing quality corpora, particularly for low-resource languages, CoTranslate addresses linguistic barriers and enhances translation quality. By streamlining the creation of robust training datasets, CoTranslate holds significant potential to impact the field of machine translation.

RL1 2023

Ir a la publicación

Publisher: arXiv, Link>

ABSTRACT

The success of neural network embeddings has entailed a renewed interest in using knowledge graphs for a wide variety of machine learning and information retrieval tasks. In particular, current recommendation methods based on graph embeddings have shown state-of-the-art performance. These methods commonly encode latent rating patterns and content features. Different from previous work, in this paper, we propose to exploit embeddings extracted from graphs that combine information from ratings and aspect-based opinions expressed in textual reviews. We then adapt and evaluate state-of-the-art graph embedding techniques over graphs generated from Amazon and Yelp reviews on six domains, outperforming baseline recommenders. Our approach has the advantage of providing explanations which leverage aspect-based opinions given by users about recommended items. Furthermore, we also provide examples of the applicability of recommendations utilizing aspect opinions as explanations in a visualization dashboard, which allows obtaining information about the most and least liked aspects of similar users obtained from the embeddings of an input graph.

RL1 2022

Ir a la publicación

Publisher: arXiv, Link>

ABSTRACT

The success of pretrained word embeddings has motivated their use in the biomedical domain, with contextualized embeddings yielding remarkable results in several biomedical NLP tasks. However, there is a lack of research on quantifying their behavior under severe "stress" scenarios. In this work, we systematically evaluate three language models with adversarial examples -- automatically constructed tests that allow us to examine how robust the models are. We propose two types of stress scenarios focused on the biomedical named entity recognition (NER) task, one inspired by spelling errors and another based on the use of synonyms for medical terms. Our experiments with three benchmarks show that the performance of the original models decreases considerably, in addition to revealing their weaknesses and strengths. Finally, we show that adversarial training causes the models to improve their robustness and even to exceed the original performance in some cases.

RL1 2022

Ir a la publicación

info@cenia.cl

Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile

Andrés Carvallo

PUBLICACIONES

A comparative dataset: Bridging COVID-19 and other diseases through epistemonikos and CORD-19 evidence

Automatic Section Classification in Spanish Clinical Narratives Using Chunked Named Entity Recognition

CoTranslate: A web-based tool for crowdsourcing high-quality sentence pair corpora

Graphing else matters: exploiting aspect opinions and ratings in explainable graph-based recommendat

Stress Test Evaluation of Biomedical Word Embeddings