Welcome! I am a PhD student in Computer Science at Cornell University, where I am very fortunate to be advised by Jon Kleinberg. I am currently doing extended research with the Google Brain Team, particularly Jascha Sohl-Dickstein and Quoc Le. I also collaborate frequently with Surya Ganguli at Stanford.
Before Cornell, I was at the University of Cambridge (Trinity College) where I completed my Bachelors and Masters (Part III of the Tripos) in Mathematics.
My research interests are broadly in machine learning, particularly deep learning. A primary research aim of mine is to understand and improve deep neural networks by combining systematic experiments with mathematically rooted methods. I am also more broadly interested in designing and applying interpretable machine learning models in different contexts.
Abstract.With the continuing empirical successes of deep networks, it becomes increasingly important to develop better methods for understanding training of models and the representations learned within. In this paper we propose Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less.
Abstract. We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute. Our approach is based on an interrelated set of measures of expressivity, unified by the novel notion of trajectory length, which measures how the output of a network changes as the input sweeps along a one-dimensional path. Our findings can be summarized as follows: (1) The complexity of the computed function grows exponentially with depth. (2) All weights are not equal: trained networks are more sensitive to their lower (initial) layer weights. (3) Regularizing on trajectory length (trajectory regularization) is a simpler alternative to batch normalization, with the same performance.
Abstract. We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parametrization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steady-state and mixing time. We then give an algorithm based on alternating minimization to learn LAMP models from data. Finally, we perform a series of real-world experiments to show that LAMP is more powerful than first-order Markov processes, and even holds its own against deep sequential models (LSTMs) with a negligible increase in parameter complexity.
Abstract. We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.
Abstract. Team performance is a ubiquitous area of inquiry in the social sciences, and it motivates the problem of team selection -- choosing the members of a team for maximum performance. Influential work of Hong and Page has argued that testing individuals in isolation and then assembling the highest-scoring ones into a team is not an effective method for team selection. For a broad class of performance measures, based on the expected maximum of random variables representing individual candidates, we show that tests directly measuring individual performance are indeed ineffective, but that a more subtle family of tests used in isolation can provide a constant-factor approximation for team performance. These new tests measure the 'potential' of individuals, in a precise sense, rather than performance; to our knowledge they represent the first time that individual tests have been shown to produce near-optimal teams for a non-trivial team performance measure. We also show families of subdmodular and supermodular team performance functions for which no test applied to individuals can produce near-optimal teams, and discuss implications for submodular maximization via hill-climbing.
Abstract.Two recently developed methods, Feedback Alignment (FA) and Direct Feedback Alignment (DFA), have been shown to obtain surprising performance on vision tasks by replacing the traditional backpropagation update with a random feedback update. However, it is still not clear what mechanisms allow learning to happen with these random updates. In this work we argue that DFA can be viewed as a noisy variant of a layer-wise training method we call Linear Aligned Feedback Systems (LAFS). We support this connection theoretically by comparing the update rules for the two methods. We additionally empirically verify that the random update matrices used in DFA work effectively as readout matrices, and that strong correlations exist between the error vectors used in the DFA and LAFS updates. With this new connection between DFA and LAFS we are able to explain why the “alignment” happens in DFA.