Welcome! I am a PhD student in Computer Science at Cornell University, where I am very fortunate to be advised by Jon Kleinberg. I am currently doing extended research with the Google Brain Team, where my primary mentors are Quoc Le and Samy Bengio. In the past, I've collaborated frequently with Jascha Sohl-Dickstein and Surya Ganguli at Stanford.
Before Cornell, I was at the University of Cambridge (Trinity College) where I completed my Bachelors and Masters (Part III of the Tripos) in Mathematics. Prior to that I was very fortunate to spend my final years in high school competing in national and international mathematical Olympaids. The highlight was being part of the UK team at the China Girls Math Olympiad.
My research interests are broadly in machine learning, particularly deep learning. A primary research aim of mine is to understand and improve deep neural networks by combining systematic experiments with mathematically rooted methods. I am also more broadly interested in designing and applying interpretable machine learning models in different contexts.
Abstract. Deep reinforcement learning has achieved many recent successes, but our understanding of its strengths and limitations is hampered by the lack of rich environments in which we can fully characterize optimal behavior, and correspondingly diagnose individual actions against such a characterization. Here we consider a family of combinatorial games, arising from work of Erdos, Selfridge, and Spencer, and we propose their use as environments for evaluating and comparing different approaches to reinforcement learning. These games have a number of appealing features: they are challenging for current learning approaches, but they form (i) a low-dimensional, simply parametrized environment where (ii) there is a linear closed form solution for optimal behavior from any state, and (iii) the difficulty of the game can be tuned by changing environment parameters in an interpretable way. We use these Erdos-Selfridge-Spencer games not only to compare different algorithms, but also to compare approaches based on supervised and reinforcement learning, to analyze the power of multi-agent approaches in improving performance, and to evaluate generalization to environments outside the training set.
Abstract.We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less. Code
Abstract. We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute. Our approach is based on an interrelated set of measures of expressivity, unified by the novel notion of trajectory length, which measures how the output of a network changes as the input sweeps along a one-dimensional path. Our findings can be summarized as follows: (1) The complexity of the computed function grows exponentially with depth. (2) All weights are not equal: trained networks are more sensitive to their lower (initial) layer weights. (3) Regularizing on trajectory length (trajectory regularization) is a simpler alternative to batch normalization, with the same performance.
Abstract. We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parametrization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steady-state and mixing time. We then give an algorithm based on alternating minimization to learn LAMP models from data. Finally, we perform a series of real-world experiments to show that LAMP is more powerful than first-order Markov processes, and even holds its own against deep sequential models (LSTMs) with a negligible increase in parameter complexity.
Abstract. We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.
Abstract. Team performance is a ubiquitous area of inquiry in the social sciences, and it motivates the problem of team selection -- choosing the members of a team for maximum performance. Influential work of Hong and Page has argued that testing individuals in isolation and then assembling the highest-scoring ones into a team is not an effective method for team selection. For a broad class of performance measures, based on the expected maximum of random variables representing individual candidates, we show that tests directly measuring individual performance are indeed ineffective, but that a more subtle family of tests used in isolation can provide a constant-factor approximation for team performance. These new tests measure the 'potential' of individuals, in a precise sense, rather than performance; to our knowledge they represent the first time that individual tests have been shown to produce near-optimal teams for a non-trivial team performance measure. We also show families of subdmodular and supermodular team performance functions for which no test applied to individuals can produce near-optimal teams, and discuss implications for submodular maximization via hill-climbing.
Abstract.Two recently developed methods, Feedback Alignment (FA) and Direct Feedback Alignment (DFA), have been shown to obtain surprising performance on vision tasks by replacing the traditional backpropagation update with a random feedback update. However, it is still not clear what mechanisms allow learning to happen with these random updates. In this work we argue that DFA can be viewed as a noisy variant of a layer-wise training method we call Linear Aligned Feedback Systems (LAFS). We support this connection theoretically by comparing the update rules for the two methods. We additionally empirically verify that the random update matrices used in DFA work effectively as readout matrices, and that strong correlations exist between the error vectors used in the DFA and LAFS updates. With this new connection between DFA and LAFS we are able to explain why the “alignment” happens in DFA.