Cristóbal Rojas

Cristóbal Rojas

Especialidad: Algoritmos, aprendizaje, sistemas dinámicos y aplicaciones a la ciencia de datos.
Cristóbal Rojas tiene un PhD en Computer Science y en Applied Mathematics de Ecole Polytechnique y University of Pisa, respectivamente, ambos obtenidos en 2008. Ha trabajado en el proyecto Fundamentos matemáticos de arquitecturas modernas de IA, analizando arquitecturas como redes neuronales de grafos y Transformers. Busca resultados matemáticos que caractericen la relación entre la elección de elementos de estas arquitecturas y sus habilidades de expresividad, entrenabilidad y generalización, realizando exploraciones empíricas para evaluar su impacto práctico. También participa en el proyecto Poder computacional en sistemas dinámicos y aplicaciones, buscando entender el poder computacional de sistemas naturales modelados como sistemas dinámicos. El objetivo a largo plazo es clasificar clases relevantes de sistemas dinámicos según su capacidad de procesar información y realizar cómputos. Como aplicaciones, utiliza conceptos de la teoría de sistemas dinámicos para analizar señales biológicas de la red neuronal retiniana, buscando marcadores para clasificar la red subyacente según su capacidad de procesar estímulos visuales.

PUBLICACIONES

Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on the practical scale.

An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE's success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them. We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use. Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.

Using tools from computable analysis, we develop a notion of effectiveness for general dynamical systems as those group actions on arbitrary spaces that contain a computable representative in their topological conjugacy class. Most natural systems one can think of are effective in this sense, including some group rotations, affine actions on the torus and finitely presented algebraic actions. We show that for finitely generated and recursively presented groups, every effective dynamical system is the topological factor of a computable action on an effectively closed subset of the Cantor space. We then apply this result to extend the simulation results available in the literature beyond zero-dimensional spaces. In particular, we show that for a large class of groups, many of these natural actions are topological factors of subshifts of finite type.

We study the computational problem of rigorously describing the asymptotic behavior of topological dynamical systems up to a finite but arbitrarily small pre-specified error. More precisely, we consider the limit set of a typical orbit, both as a spatial object (attractor set) and as a statistical distribution (physical measure), and we prove upper bounds on the computational resources of computing descriptions of these objects with arbitrary accuracy. We also study how these bounds are affected by different dynamical constraints and provide several examples showing that our bounds are sharp in general. In particular, we exhibit a computable interval map having a unique transitive attractor with Cantor set structure supporting a unique physical measure such that both the attractor and the measure are non-computable.

We revisit the problem of algorithmically deciding whether a given infinite connected graph has an Eulerian path, namely, a path that uses every edge exactly once. It has been recently observed that this problem is D0 3 -complete for graphs that have a computable description, whereas it is Π02-complete for graphs that have a highly computable description, and that this same bound holds for the class of automatic graphs. A closely related problem consists of determining the number of ends of a graph, namely, the maximum number of distinct infinite connected components the graph can be separated into after removing a finite set of edges. The complexity of this problem for highly computable graphs is known to be Π02-complete as well. The connection between these two problems lies in that only graphs with one or two ends can have Eulerian paths. In this paper we are interested in understanding the complexity of the infinite Eulerian path problem in the setting where the input graphs are known to have the right number of ends. We find that in this setting the problem becomes strictly easier, and that its exact difficulty varies according to whether the graphs have one or two ends, and to whether the Eulerian path we are looking for is one-way or bi-infinite. For example, we find that deciding existence of a bi-infinite Eulerian path for one-ended graphs is only Π0 1-complete if the graphs are highly computable, and that the same problem becomes decidable for automatic graphs. Our results are based on a detailed computability analysis of what we call the Separation Problem, whichwe believe to be of independent interest. For instance, as a side application, we observe that K¨onig’s infinity lemma, well known to be non-effective in general, becomes effective if we restrict to graphs with finitely many ends.

We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023), and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

We propose the first method to show theoretical limitations for one-layer softmax transformers with arbitrarily many precision bits (even infinite). We establish those limitations for three tasks that require advanced reasoning. The first task, Match 3 (Sanford et al., 2023), requires looking at all possible token triplets in an input sequence. The second and third tasks address compositionality-based reasoning: function composition (Peng et al., 2024) and binary relations composition, respectively. We formally prove the inability of one-layer softmax Transformers to solve any of these tasks. To overcome these limitations, we introduce Strassen attention and prove that, equipped with this mechanism, a one-layer transformer can in principle solve all these tasks. Importantly, we show that it enjoys sub-cubic running-time complexity, making it more scalable than similar previously proposed mechanisms, such as higher-order attention (Sanford et al., 2023). To complement our theoretical findings, we experimentally studied Strassen attention and compared it against standard (Vaswani et al, 2017), higher-order attention (Sanford et al., 2023), and triangular attention (Bergen et al. 2021). Our results help to disentangle all these attention mechanisms, highlighting their strengths and limitations. In particular, Strassen attention outperforms standard attention significantly on all the tasks. Altogether, understanding the theoretical limitations can guide research towards scalable attention mechanisms that improve the reasoning abilities of Transformers.

In 1946, S. Ulam invented Monte Carlo method, which has since become the standard numerical technique for making statistical predictions for long-term behaviour of dynamical systems. We show that this, or in fact any other numerical approach can fail for the simplest non-linear discrete dynamical systems given by the logistic maps f_{a}(x)=ax(1-x) of the unit interval. We show that there exist computable real parameters a\in (0,4) for which almost every orbit of f_a has the same asymptotical statistical distribution in [0,1], but this limiting distribution is not Turing computable.

agencia nacional de investigación y desarrollo
Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile