Andrés Neyem

Andrés Neyem

Especialidad: Ingeniería de software, computación móvil y en la nube, aprendizaje automático para sistemas inteligentes, educación médica y de ingeniería, realidad extendida
Andrés es profesor del Departamento de Ciencias de la Computación de la Pontificia Universidad Católica de Chile. Recibió su Ph.D. en informática en la Universidad de Chile. Sus intereses de investigación incluyen ingeniería de software, computación móvil y en la nube, aprendizaje automático para sistemas inteligentes, ingeniería y educación médica y realidad extendida. En estas áreas de investigación ha publicado una amplia gama de artículos en actas de congresos y revistas. Ha desarrollado varios productos de software de este tipo de sistemas móviles basados ​​en la nube.

PUBLICACIONES

Software Capstone Projects provide valuable hands-on experience for students in software development, and creating effective commit messages is an essential, though often challenging, part of this process. These messages play a key role in managing repositories, facilitating collaboration, and offering insights into the project’s progression for mentors and managers. However, creating high-quality commit messages can be challenging, especially for novice developers. We introduce LetsCommit, a tool designed to improve the traditional Git commit command line interface. The tool utilizes three state-of-the-art Large Language Models (LLMs): GPT-3.5, GPT-4, and LLaMa-2, to provide commit message suggestions to students. Results from a user experience survey showed high satisfaction, indicating strong potential for incorporating LetsCommit into future projects. Beyond its technical applications, LetsCommit possesses transformative potential in the field of education. The iterative learning process it supports, coupled with real-time insights, reinforces good software development practices and enhances the overall learning experience. These findings highlight LetsCommit’s substantial impact on software engineering education, setting the stage for further advancements.

In medical education, traditional anatomy labs have relied heavily on the hands‐on dissection of cadavers to teach the complex spatial relationships within the human body. However, the advent of virtual reality (VR) technology offers the potential for significantly enhancing this traditional approach by providing immersive, interactive 3D visualizations that can overcome some of the limitations of physical specimens. This study explores the integration of VR into a traditional gross anatomy lab to enrich the learning experience for medical students. Methods included the deployment of a VR application developed to complement the dissection process, featuring detailed 3D models of human anatomy that students could manipulate and explore digitally. Approximately 60 s‐year medical students participated in the lab, where they engaged with both traditional dissection and the VR application. Results indicated that the VR integration not only increased engagement and satisfaction but also improved the students' ability to understand anatomical structures and their spatial relationships. Moreover, feedback from students suggested more efficient learning and retention than with traditional methods alone. We conclude that VR technology can significantly enhance medical anatomy education by providing an adjunct to traditional dissection, potentially replacing certain aspects of physical specimens with digital simulations that offer repeatable, detailed exploration without the associated logistical and ethical constraints.

Large language models (LLMs) such as GPT-4o have the potential to transform clinical decision-making, patient education, and medical research. Despite impressive performance in generating patient-friendly educational materials and assisting in clinical documentation, concerns remain regarding the reliability, subtle errors, and biases that can undermine their use in high-stakes medical settings. A multi-phase experimental design was employed to assess the performance of GPT-4o on the Chilean anesthesiology exam (CONACEM), which comprised 183 questions covering four cognitive domains—Understanding, Recall, Application, and Analysis—based on Bloom’s taxonomy. Thirty independent simulation runs were conducted with systematic variation of the model’s temperature parameter to gauge the balance between deterministic and creative responses. The generated responses underwent qualitative error analysis using a refined taxonomy that categorized errors such as “Unsupported Medical Claim,” “Hallucination of Information,” “Sticking with Wrong Diagnosis,” “Non-medical Factual Error,” “Incorrect Understanding of Task,” “Reasonable Response,” “Ignore Missing Information,” and “Incorrect or Vague Conclusion.” Two board-certified anesthesiologists performed independent annotations, with disagreements resolved by a third expert. Statistical evaluations—including one-way ANOVA, non-parametric tests, chi-square, and linear mixed-effects modeling—were used to compare performance across domains and analyze error frequency. GPT-4o achieved an overall accuracy of 83.69%. Performance varied significantly by cognitive domain, with the highest accuracy observed in the Understanding (90.10%) and Recall (84.38%) domains, and lower accuracy in Application (76.83%) and Analysis (76.54%). Among the 120 incorrect responses, unsupported medical claims were the most common error (40.69%), followed by vague or incorrect conclusions (22.07%). Co-occurrence analyses revealed that unsupported claims often appeared alongside imprecise conclusions, highlighting a trend of compounded errors particularly in tasks requiring complex reasoning. Inter-rater reliability for error annotation was robust, with a mean Cohen’s kappa of 0.73. While GPT-4o exhibits strengths in factual recall and comprehension, its limitations in handling higher-order reasoning and diagnostic judgment are evident through frequent unsupported medical claims and vague conclusions. These findings underscore the need for improved domain-specific fine-tuning, enhanced error mitigation strategies, and integrated knowledge verification mechanisms prior to clinical deployment.

Large Language Models (LLMs) have demonstrated strong performance on English-language medical exams, but their effectiveness in non-English, high-stakes environments is less understood. This study benchmarks nine LLMs against human examinees on the Chilean Anesthesiology Certification Exam (CONACEM), a Spanish-language board examination. A curated set of 63 multiple-choice questions was used, categorized by Bloom’s taxonomy into four cognitive levels. Model responses were assessed using Item Response Theory and Classical Test Theory, complemented by additional error analysis, categorizing errors as reasoning-based, knowledge-based, or comprehension-related. Closed-source models surpassed open-source models, with GPT-o1 achieving the highest accuracy (88.7%). Deepseek-R1 is a strong performer among open-source options. Item difficulty significantly predicted the model accuracy, while discrimination did not. Most errors occurred in application and understanding tasks and were linked to flawed reasoning or knowledge misapplication. These results underscore LLMs’ potential for factual recall in Spanish medical exams but also their limitations in complex reasoning. Incorporating cognitive classification and error taxonomy provides deeper insights into model behavior and supports their cautious use as educational aids in clinical settings.

In software engineering pedagogy, a persistent challenge is the comprehensive assessment of student contributions within software repositories. This study delves into the investigation of the Git-Truck tool, initially designed for professional software engineers, and explores its adaptability and effectiveness within an academic setting. We specifically focus on the tool's potential for educators when assessing Capstone software repositories. Our results emphasize that educators found bubble chart visualization and metrics such as ''Top Contributor'' and ''Number of Commits'' helpful in understanding group dynamics and contribution. We also discuss the tool's limitations among visual techniques and metrics used. As the educational landscape shifts towards increased virtual and remote modalities, tools like Git-Truck are poised to augment the intricacy and depth of software project evaluations. For those considering adopting or adapting such tools in similar contexts, our study offers the challenges and lessons learned from this experience.

Automation of code reviews using AI models has garnered substantial attention in the software engineering community as a strategy to reduce the cost and effort associated with traditional peer review processes. These models are typically trained on extensive datasets of real-world code reviews that address diverse software development concerns, including testing, refactoring, bug fixes, performance optimization, and maintainability improvements. However, a notable limitation of these datasets is the under representation of code vulnerabilities, critical flaws that pose significant security risks, with security-focused reviews comprising a small fraction of the data. This scarcity of vulnerability-specific data restricts the effectiveness of AI models in identifying and commenting on security-critical code. To address this issue, we propose the creation of a synthetic dataset consisting of vulnerability-focused reviews that specifically comment on security flaws. Our approach leverages Large Language Models (LLMs) to generate human-like code review comments for vulnerabilities, using insights derived from code differences and commit messages. To evaluate the usefulness of the generated synthetic dataset, we plan to use it to fine-tune three existing code review models. We anticipate that the synthetic dataset will improve the performance of the original code review models.

Large language models (LLMs) like GPT-4o have shown promise in advancing medical decision-making and education. However, their performance in Spanish-language medical contexts remains underexplored. This study evaluates the effectiveness of single-agent and multi-agent strategies in answering questions from the EUNACOM, a standardized medical licensure exam in Chile, across 21 medical specialties. Methods GPT-4o was tested on 1,062 multiple-choice questions from publicly available EUNACOM preparation materials. Single-agent strategies included Zero-Shot, Few-Shot, Chain-of-Thought (CoT), Self-Reflection, and MED-PROMPT, while multi-agent strategies involved Voting, Weighted Voting, Borda Count, MEDAGENTS, and MDAGENTS. Each strategy was tested under three temperature settings (0.3, 0.6, 1.2). Performance was assessed by accuracy, and statistical analyses, including Kruskal–Wallis and Mann–Whitney U tests, were performed. Computational resource utilization, such as API calls and execution time, was also analyzed. Results MDAGENTS achieved the highest accuracy with a mean score of 89.97% (SD = 0.56%), outperforming all other strategies ( p < 0 . 001). MEDAGENTS followed with a mean score of 87.99% (SD = 0.49%), and the CoT with Few-Shot strategy scored 87.67% (SD = 0.12%). Temperature settings did not significantly affect performance ( F 2 , 54 = 1 . 45, p = 0 . 24). Specialty-level analysis showed the highest accuracies in Psychiatry (95.51%), Neurology (95.49%), and Surgery (95.38%), while lower accuracies were observed in Neonatology (77.54%), Otolaryngology (76.64%), and Urology/Nephrology (76.59%). Notably, several exam questions were correctly answered using simpler single-agent strategies without employing complex reasoning or collaboration frameworks. Conclusions and relevance Multi-agent strategies, particularly MDAGENTS, significantly enhance GPT-4o’s performance on Spanish-language medical exams, leveraging collaboration to improve diagnostic accuracy. However, simpler single-agent strategies are sufficient to address many questions, high-lighting that only a fraction of standardized medical exams require sophisticated reasoning or multi-agent interaction. These findings suggest potential for LLMs as efficient and scalable tools in Spanish-speaking healthcare, though computational optimization remains a key area for future research.

Publisher: IEEE Transactions on Learning Technologies Link>

ABSTRACT

Software assistants have significantly impacted software development for both practitioners and students, particularly in capstone projects. The effectiveness of these tools varies based on their knowledge sources; assistants with localized domain-specific knowledge may have limitations, while tools, such as ChatGPT, using broad datasets, might offer recommendations that do not always match the specific objectives of a capstone course. Addressing a gap in current educational technology, this article introduces an AI Knowledge Assistant specifically designed to overcome the limitations of the existing tools by enhancing the quality and relevance of large language models (LLMs). It achieves this through the innovative integration of contextual knowledge from a local “lessons learned” database tailored to the capstone course. We conducted a study with 150 students using the assistant during their capstone course. Integrated into the Kanban project tracking system, the assistant offered recommendations using different strategies: direct searches in the lessons learned database, direct queries to a generative pretrained transformers (GPT) model, query enrichment with lessons learned before submission to GPT and large language model meta AI (LLaMa) models, and query enhancement with Stack Overflow data before GPT processing. Survey results underscored a strong preference among students for direct LLM queries and those enriched with local repository insights, highlighting the assistant's practical value. Furthermore, our linguistic analysis conclusively demonstrated that texts generated by the LLM closely mirrored the linguistic standards and topical relevance of university course requirements. This alignment not only fosters a deeper understanding of course content but also significantly enhances the material's applicability to real-world scenarios.

agencia nacional de investigación y desarrollo
Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile