Alexander Kozachinskiy

Alexander Kozachinskiy

Especialidad: Algorithms, information theory, algorithmic game theory, foundations of machine learning, complexity theory.
Alexander Kozachinskiy posee un PhD en Physics and Mathematics. Obtuvo este grado de la Lomonosov Moscow State University en 2019. Ha trabajado en el proyecto llamado "Expressive power of deep learning architectures", y es el investigador principal del Fondecyt iniciación N° 11250060.

PUBLICACIONES

Kolmogorov (1965) defined the complexity of a string as the minimal length of a program generating . Obviously this definition depends on the choice of the programming language. Kolmogorov noted that there exist \emph{optimal} programming languages that make the complexity function minimal up to additive terms, and we should take one of them -- but which one? Is there a chance to agree on some specific programming language in this definition? Or at least should we add some other requirements to optimality? What can we achieve in this way? In this paper we discuss different suggestions of this type that appeared since 1965, specifically a stronger requirement of universality (and show that in many cases this does not change the set of complexity functions).

Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.

An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE's success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them. We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use. Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.

"Delle Rose et al. (COLT’23) introduced an effective version of the Vapnik-Chervonenkis dimension, and showed that it characterizes improper PAC learning with total computable learners. In this paper, we introduce and study a similar effectivization of the notion of Littlestone dimension. Finite effective Littlestone dimension is a necessary condition for computable online learning but is not a sufficient one—which we already establish for classes of the effective Littlestone dimension 2. However, the effective Littlestone dimension equals the optimal mistake bound for computable learners in two special cases: a) for classes of Littlestone dimension 1 and b) when the learner receives as additional information a bound on the numbers to be guessed. Interestingly, finite effective Littlestone dimension also guarantees that the class consists only of computable functions. Keywords: Online learning, Littlestone dimension, computable machine learning"

The notion of rank of a Boolean function has been a cornerstone in PAC learning, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function corresponds to the minimum number of Chain of Thought (CoT) steps required by a single-layer Transformer with hard attention to compute . Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that-fold function composition necessitates exactly CoT steps. Furthermore, we analyze the problem of identifying the position of the-th occurrence of 1 in a Boolean sequence, proving that it requires CoT steps.

The notion of \emph{rank} of a Boolean function has been a cornerstone in PAC learning theory, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function f corresponds to the minimum number of \emph{Chain of Thought} (CoT) steps required by a single-layer Transformer with hard attention to compute f. Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that \(\ell\)-fold function composition necessitates exactly \(\ell\) CoT steps. Furthermore, we analyze the problem of identifying the position of the \(k\)-th occurrence of 1 in a Boolean sequence, proving that it requires \(k\) CoT steps. Finally, we introduce the notion of the multi-head rank that captures multi-head single-layer transformers, and perform the analysis of PAC-learnability of the classes of functions with bounded multi-head rank.

Despite the wide use of k-Nearest Neighbors as classification models, their explainability properties remain poorly understood from a theoretical perspective. While nearest neighbors classifiers offer interpretability from a ''data perspective'', in which the classification of an input vector x is explained by identifying the vectors v1, ..., vk in the training set that determine the classification of x, we argue that such explanations can be impractical in high-dimensional applications, where each vector has hundreds or thousands of features and it is not clear what their relative importance is. Hence, we focus on understanding nearest neighbor classifications through a ''feature perspective'', in which the goal is to identify how the values of the features in x affect its classification. Concretely, we study abductive explanations such as ''minimum sufficient reasons'', which correspond to sets of features in x that are enough to guarantee its classification, and counterfactual explanations based on the minimum distance feature changes one would have to perform in x to change its classification. We present a detailed landscape of positive and negative complexity results for counterfactual and abductive explanations, distinguishing between discrete and continuous feature spaces, and considering the impact of the choice of distance function involved. Finally, we show that despite some negative complexity results, Integer Quadratic Programming and SAT solving allow for computing explanations in practice.

Publisher:  LATIN 2024: Theoretical Informatics Link>

ABSTRACT

In this paper, we construct a winning condition W over a finite set of colors such that, first, every finite arena has a strategy with 2 states of general memory which is optimal w.r.t. W, and second, there exists no k such that every finite arena has a strategy with k states of chromatic memory which is optimal w.r.t. W.

In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM_2 task, considered by Sanford, Hsu, and Telgarsky.

Formal XAI (explainable AI) is a growing area that focuses on computing explanations with mathematical guarantees for the decisions made by ML models. Inside formal XAI, one of the most studied cases is that of explaining the choices taken by decision trees, as they are traditionally deemed as one of the most interpretable classes of models. Recent work has focused on studying the computation of sufficient reasons, a kind of explanation in which given a decision tree T and an instance x, one explains the decision T (x) by providing a subset y of the features of x such that for any other instance z compatible with y, it holds that T (z) = T (x), intuitively meaning that the features in y are already enough to fully justify the classification of x by T. It has been argued, however, that sufficient reasons constitute a restrictive notion of explanation. For such a reason, the community has started to study their probabilistic counterpart, in which one requires that the probability of T (z) = T (x) must be at least some value δ ∈ (0, 1], where z is a random instance that is compatible with y. Our paper settles the computational complexity of δ-sufficient-reasons over decision trees, showing that both (1) finding δ-sufficient-reasons that are minimal in size, and (2) finding δ-sufficient-reasons that are minimal inclusion-wise, are computationally intractable. By doing this, we answer two open problems originally raised by Izza et al. (2021), and extend the hardness of explanations for Boolean circuits presented by Wäldchen et al. (2021) to the more restricted case of decision trees. Furthermore, we present sharp non-approximability results under a widely believed complexity hypothesis. On the positive side, we identify structural restrictions of decision trees that make the problem tractable.

agencia nacional de investigación y desarrollo
Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile