Miguel Romero

Miguel Romero

Especialidad: Algoritmos y complejidad, gestión de datos, aprendizaje automático (por ejemplo, explicabilidad formal, redes neuronales gráficas).
Miguel Romero completó un Phd in Computer Science en la Universidad de Chile en 2016. En el proyecto de Expresividad de redes neuronales de grafos, caracterizó el poder de estas redes en términos de lógicas formales y tests de isomorfismos de grafos. Abordó esto en varios contextos, incluyendo clasificación de nodos y predicción de links, extensiones a hipergrafos, y modelos fundacionales que no asumen un vocabulario previo. En el proyecto de Explicabilidad en modelos de machine learning, estudió la complejidad computacional de calcular nociones de explicaciones formales, tanto deterministas como probabilistas, para distintos modelos. Propuso un framework para razonar sobre shapley values bajo incerteza de las distribuciones subyacentes.

PUBLICACIONES

Despite the wide use of k-Nearest Neighbors as classification models, their explainability properties remain poorly understood from a theoretical perspective. While nearest neighbors classifiers offer interpretability from a ''data perspective'', in which the classification of an input vector x is explained by identifying the vectors v1, ..., vk in the training set that determine the classification of x, we argue that such explanations can be impractical in high-dimensional applications, where each vector has hundreds or thousands of features and it is not clear what their relative importance is. Hence, we focus on understanding nearest neighbor classifications through a ''feature perspective'', in which the goal is to identify how the values of the features in x affect its classification. Concretely, we study abductive explanations such as ''minimum sufficient reasons'', which correspond to sets of features in x that are enough to guarantee its classification, and counterfactual explanations based on the minimum distance feature changes one would have to perform in x to change its classification. We present a detailed landscape of positive and negative complexity results for counterfactual and abductive explanations, distinguishing between discrete and continuous feature spaces, and considering the impact of the choice of distance function involved. Finally, we show that despite some negative complexity results, Integer Quadratic Programming and SAT solving allow for computing explanations in practice.

Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show that the expressive power of KGFMs directly depends on the motifs that are used to learn the relation representations. We then observe that the most typical motifs used in the existing literature are binary, as the representations are learned based on how pairs of relations interact, which limits the model's expressiveness. As part of our study, we design more expressive KGFMs using richer motifs, which necessitate learning relation representations based on, e.g., how triples of relations interact with each other. Finally, we empirically validate our theoretical findings, showing that the use of richer motifs results in better performance on a wide range of datasets drawn from different domains.

Link prediction with knowledge graphs has been thoroughly studied in graph machine learning, leading to a rich landscape of graph neural network architectures with successful applications. Nonetheless, it remains challenging to transfer the success of these architectures to relational hypergraphs, where the task of link prediction is over k-ary relations, which is substantially harder than link prediction with knowledge graphs. In this paper, we propose a framework for link prediction with relational hypergraphs, unlocking applications of graph neural networks to fully relational structures. Theoretically, we conduct a thorough analysis of the expressive power of the resulting model architectures via corresponding relational Weisfeiler-Leman algorithms and also via logical expressiveness. Empirically, we validate the power of the proposed model architectures on various relational hypergraph benchmarks. The resulting model architectures substantially outperform every baseline for inductive link prediction, and lead to state-of-the-art results for transductive link prediction.

We study the minimization problem for Conjunctive Regular Path Queries (CRPQs) and unions of CRPQs (UCRPQs). This is the problem of checking, given a query and a number k, whether the query is equivalent to one of size at most k. For CRPQs we consider the size to be the number of atoms, and for UCRPQs the maximum number of atoms in a CRPQ therein, motivated by the fact that the number of atoms has a leading influence on the cost of query evaluation. We show that the minimization problem is decidable, both for CRPQs and UCRPQs. We provide a 2ExpSpace upper-bound for CRPQ minimization, based on a brute-force enumeration algorithm, and an ExpSpace lower-bound. For UCRPQs, we show that the problem is ExpSpace-complete, having thus the same complexity as the classical containment problem. The upper bound is obtained by defining and computing a notion of maximal under-approximation. Moreover, we show that for UCRPQs using the so-called "simple regular expressions" consisting of concatenations of expressions of the form a^+ or a_1 + \dotsb + a_k, the minimization problem becomes \Pi^p_2-complete, again matching the complexity of containment.

Formal XAI (explainable AI) is a growing area that focuses on computing explanations with mathematical guarantees for the decisions made by ML models. Inside formal XAI, one of the most studied cases is that of explaining the choices taken by decision trees, as they are traditionally deemed as one of the most interpretable classes of models. Recent work has focused on studying the computation of sufficient reasons, a kind of explanation in which given a decision tree T and an instance x, one explains the decision T (x) by providing a subset y of the features of x such that for any other instance z compatible with y, it holds that T (z) = T (x), intuitively meaning that the features in y are already enough to fully justify the classification of x by T. It has been argued, however, that sufficient reasons constitute a restrictive notion of explanation. For such a reason, the community has started to study their probabilistic counterpart, in which one requires that the probability of T (z) = T (x) must be at least some value δ ∈ (0, 1], where z is a random instance that is compatible with y. Our paper settles the computational complexity of δ-sufficient-reasons over decision trees, showing that both (1) finding δ-sufficient-reasons that are minimal in size, and (2) finding δ-sufficient-reasons that are minimal inclusion-wise, are computationally intractable. By doing this, we answer two open problems originally raised by Izza et al. (2021), and extend the hardness of explanations for Boolean circuits presented by Wäldchen et al. (2021) to the more restricted case of decision trees. Furthermore, we present sharp non-approximability results under a widely believed complexity hypothesis. On the positive side, we identify structural restrictions of decision trees that make the problem tractable.

Graph databases are currently one of the most popular paradigms for storing data. One of the key conceptual differences between graph and relational databases is the focus on navigational queries that ask whether some nodes are connected by paths satisfying certain restrictions. This focus has driven the definition of several different query languages and the subsequent study of their fundamental properties. We define the graph query language of Regular Queries, which is a natural extension of unions of conjunctive 2-way regular path queries (UC2RPQs) and unions of conjunctive nested 2-way regular path queries (UCN2RPQs). Regular queries allow expressing complex regular patterns between nodes. We formalize regular queries as nonrecursive Datalog programs extended with the transitive closure of binary predicates. This language has been previously considered, but its algorithmic properties are not well understood. Our main contribution is to show elementary tight bounds for the containment problem for regular queries. Specifically, we show that this problem is 2Expspace-complete. For all extensions of regular queries known to date, the containment problem turns out to be non-elementary. Together with the fact that evaluating regular queries is not harder than evaluating UCN2RPQs, our results show that regular queries achieve a good balance between expressiveness and complexity, and constitute a well-behaved class that deserves further investigation.

Publisher: ArXiv Link>

ABSTRACT

Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is generally unknown, it needs to be assigned subjectively or be estimated from data, which may lead to misleading feature scores. In this paper, we propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. In our framework, we consider an uncertainty region that contains the potential distributions, and the SHAP score of a feature becomes a function defined over this region. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features. In particular, we pinpoint the complexity of these problems, and other related ones, showing them to be NP-complete. Finally, we present experiments on a real-world dataset, showing that our framework may contribute to a more robust feature scoring.

Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is generally unknown, it needs to be assigned subjectively or be estimated from data, which may lead to misleading feature scores. In this paper, we propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. In our framework, we consider an uncertainty region that contains the potential distributions, and the SHAP score of a feature becomes a function defined over this region. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features. In particular, we pinpoint the complexity of these problems, and other related ones, showing them to be NP-complete. Finally, we present experiments on a real-world dataset, showing that our framework may contribute to a more robust feature scoring.

The challenge of answering graph queries over incomplete knowledge graphs is gaining significant attention in the machine learning community. Neuro-symbolic models have emerged as a promising approach, combining good performance with high interpretability. These models utilize trained architectures to execute atomic queries and integrate modules that mimic symbolic query operators. However, most neuro-symbolic query processors are constrained to tree-like graph pattern queries. These queries admit a bottom-up execution with constant values or anchors at the leaves and the target variable at the root. While expressive, tree-like queries fail to capture critical properties in knowledge graphs, such as the existence of multiple edges between entities or the presence of triangles. We introduce a framework for answering arbitrary graph pattern queries over incomplete knowledge graphs, encompassing both cyclic queries and tree-like queries with existentially quantified leaves. These classes of queries are vital for practical applications but are beyond the scope of most current neuro-symbolic models. Our approach employs an approximation scheme that facilitates acyclic traversals for cyclic patterns, thereby embedding additional symbolic bias into the query execution process. Our experimental evaluation demonstrates that our framework performs competitively on three datasets, effectively handling cyclic queries through our approximation strategy. Additionally, it maintains the performance of existing neuro-symbolic models on anchored tree-like queries and extends their capabilities to queries with existentially quantified variables.

agencia nacional de investigación y desarrollo
Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile