Deep Learning, Neuroscience, and Psychology
Theory of Learning Lab
Gatsby Computational Neuroscience Unit & Sainsbury Wellcome Centre
University College London
CIFAR Azrieli Global Scholar, CIFAR Program on Learning in Machines & Brains
Education:The theory of deep learning and its applications to phenomena in neuroscience and psychology.
Mannelli, S. S., Ivashynka, Y., Saxe, A., & Saglietti, L. (2024). Tilting the Odds at the Lottery: the Interplay of Overparameterisation and Curricula in Neural Networks. arXiv. https://doi.org/10.48550/arXiv.2406.01589
Abstract
|
DOI
A wide range of empirical and theoretical works have shown that overparameterisation can amplify the performance of neural networks. According to the lottery ticket hypothesis, overparameterised networks have an increased chance of containing a sub-network that is well-initialised to solve the task at hand. A more parsimonious approach, inspired by animal learning, consists in guiding the learner towards solving the task by curating the order of the examples, i.e. providing a curriculum. However, this learning strategy seems to be hardly beneficial in deep learning applications. In this work, we undertake an analytical study that connects curriculum learning and overparameterisation. In particular, we investigate their interplay in the online learning setting for a 2-layer network in the XOR-like Gaussian Mixture problem. Our results show that a high degree of overparameterisation – while simplifying the problem – can limit the benefit from curricula, providing a theoretical account of the ineffectiveness of curricula in deep learning.
Singh, A. K., Moskovitz, T., Hill, F., Chan, S. C. Y., & Saxe, A. M. (2024). What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation. arXiv. http://arxiv.org/abs/2404.07129
Abstract
In-context learning is a powerful emergent ability in transformer models. Prior work in mechanistic interpretability has identified a circuit element that may be critical for in-context learning – the induction head (IH), which performs a match-and-copy operation. During training of large transformers on natural language data, IHs emerge around the same time as a notable phase change in the loss. Despite the robust evidence for IHs and this interesting coincidence with the phase change, relatively little is known about the diversity and emergence dynamics of IHs. Why is there more than one IH, and how are they dependent on each other? Why do IHs appear all of a sudden, and what are the subcircuits that enable them to emerge? We answer these questions by studying IH emergence dynamics in a controlled setting by training on synthetic data. In doing so, we develop and share a novel optogenetics-inspired causal framework for modifying activations throughout training. Using this framework, we delineate the diverse and additive nature of IHs. By clamping subsets of activations throughout training, we then identify three underlying subcircuits that interact to drive IH formation, yielding the phase change. Furthermore, these subcircuits shed light on data-dependent properties of formation, such as phase change timing, already showing the promise of this more in-depth understanding of subcircuits that need to "go right" for an induction head.
Lee, J. H., Mannelli, S. S., & Saxe, A. (2024). Why Do Animals Need Shaping? A Theory of Task Composition and Curriculum Learning. arXiv. http://arxiv.org/abs/2402.18361
Abstract
Diverse studies in systems neuroscience begin with extended periods of curriculum training known as ‘shaping’ procedures. These involve progressively studying component parts of more complex tasks, and can make the difference between learning a task quickly, slowly or not at all. Despite the importance of shaping to the acquisition of complex tasks, there is as yet no theory that can help guide the design of shaping procedures, or more fundamentally, provide insight into its key role in learning. Modern deep reinforcement learning systems might implicitly learn compositional primitives within their multilayer policy networks. Inspired by these models, we propose and analyse a model of deep policy gradient learning of simple compositional reinforcement learning tasks. Using the tools of statistical physics, we solve for exact learning dynamics and characterise different learning strategies including primitives pre-training, in which task primitives are studied individually before learning compositional tasks. We find a complex interplay between task complexity and the efficacy of shaping strategies. Overall, our theory provides an analytical understanding of the benefits of shaping in a class of compositional tasks and a quantitative account of how training protocols can disclose useful task primitives, ultimately yielding faster and more robust learning.
Rossem, L. van, & Saxe, A. M. (2024). When Representations Align: Universality in Representation Learning Dynamics. arXiv. https://doi.org/10.48550/arXiv.2402.09142
Abstract
|
DOI
Deep neural networks come in many sizes and architectures. The choice of architecture, in conjunction with the dataset and learning algorithm, is commonly understood to affect the learned neural representations. Yet, recent results have shown that different architectures learn representations with striking qualitative similarities. Here we derive an effective theory of representation learning under the assumption that the encoding map from input to hidden representation and the decoding map from representation to output are arbitrary smooth functions. This theory schematizes representation learning dynamics in the regime of complex, large architectures, where hidden representations are not strongly constrained by the parametrization. We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures, and exhibits phenomena similar to the "rich" and "lazy" regime. While many network behaviors depend quantitatively on architecture, our findings point to certain behaviors that are widely conserved once models are sufficiently flexible.
Zhang, Y., Latham, P. E., & Saxe, A. (2024). Understanding Unimodal Bias in Multimodal Deep Linear Networks. arXiv. http://arxiv.org/abs/2312.00935
Abstract
Using multiple input streams simultaneously to train multimodal neural networks is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. We develop a theory of unimodal bias with multimodal deep linear networks to understand how architecture and data statistics influence this bias. This is the first work to calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We show that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. Our results, derived for multimodal linear networks, extend to nonlinear networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias. Our code is available at: https://yedizhang.github.io/unimodal-bias.html.
Carrasco-Davis, R., Masís, J., & Saxe, A. M. (2024). Meta-Learning Strategies through Value Maximization in Neural Networks. arXiv. http://arxiv.org/abs/2310.19919
Abstract
Biological and artificial learning agents face numerous choices about how to learn, ranging from hyperparameter selection to aspects of task distributions like curricula. Understanding how to make these meta-learning choices could offer normative accounts of cognitive control functions in biological learners and improve engineered systems. Yet optimal strategies remain challenging to compute in modern deep networks due to the complexity of optimizing through the entire learning process. Here we theoretically investigate optimal strategies in a tractable setting. We present a learning effort framework capable of efficiently optimizing control signals on a fully normative objective: discounted cumulative performance throughout learning. We obtain computational tractability by using average dynamical equations for gradient descent, available for simple neural network architectures. Our framework accommodates a range of meta-learning and automatic curriculum learning methods in a unified normative setting. We apply this framework to investigate the effect of approximations in common meta-learning algorithms; infer aspects of optimal curricula; and compute optimal neuronal resource allocation in a continual learning setting. Across settings, we find that control effort is most beneficial when applied to easier aspects of a task early in learning; followed by sustained effort on harder aspects. Overall, the learning effort framework provides a tractable theoretical test bed to study normative benefits of interventions in a variety of learning systems, as well as a formal account of optimal cognitive control strategies over learning trajectories posited by established theories in cognitive neuroscience.
Patel, N., Lee, S., Mannelli, S. S., Goldt, S., & Saxe, A. (2023). The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions. arXiv. http://arxiv.org/abs/2306.10404
Abstract
Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
Jarvis, D., Klein, R., Rosman, B., & Saxe, A. M. (2024). On The Specialization of Neural Modules. arXiv. http://arxiv.org/abs/2409.14981
Abstract
A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.
Flesch, T., Mante, V., Newsome, W., Saxe, A., Summerfield, C., & Sussillo, D. (2023). Are task representations gated in macaque prefrontal cortex? arXiv. http://arxiv.org/abs/2306.16733
Abstract
A recent paper (Flesch et al, 2022) describes behavioural and neural data suggesting that task representations are gated in the prefrontal cortex in both humans and macaques. This short note proposes an alternative explanation for the reported results from the macaque data.
Saglietti, L., Mannelli, S., & Saxe, A. (2022). An analytical theory of curriculum learning in teacher-student networks. Advances in Neural Information Processing Systems, 35, 21113–21127. https://proceedings.neurips.cc/paper_files/paper/2022/hash/84bad835faaf48f24d990072bb5b80ee-Abstract-Conference.html
Saxe, A., Sodhani, S., & Lewallen, S. J. (2022). The neural race reduction: Dynamics of abstraction in gated networks. International Conference on Machine Learning, 19287–19309. https://proceedings.mlr.press/v162/saxe22a.html
Lee, S., Mannelli, S. S., Clopath, C., Goldt, S., & Saxe, A. (2022). Maslow’s Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation. arXiv. http://arxiv.org/abs/2205.09029
Abstract
Continual learning - learning new tasks in sequence while maintaining performance on old tasks - remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow’s hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
Singh, A. K., Ding, D., Saxe, A., Hill, F., & Lampinen, A. K. (2023). Know your audience: specializing grounded language models with listener subtraction. arXiv. http://arxiv.org/abs/2206.08349
Abstract
Effective communication requires adapting to the idiosyncrasies of each communicative context–such as the common ground shared with each partner. Humans demonstrate this ability to specialize to their audience in many contexts, such as the popular game Dixit. We take inspiration from Dixit to formulate a multi-agent image reference game where a (trained) speaker model is rewarded for describing a target image such that one (pretrained) listener model can correctly identify it among distractors, but another listener cannot. To adapt, the speaker must exploit differences in the knowledge it shares with the different listeners. We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization from rewards only, without direct supervision. Through controlled experiments, we show that training a speaker with two listeners that perceive differently, using our method, allows the speaker to adapt to the idiosyncracies of the listeners. Furthermore, we show zero-shot transfer of the specialization to real-world data. Our experiments demonstrate a method for specializing grounded language models without direct supervision and highlight the interesting research challenges posed by complex multi-agent communication.
Lee, S., Goldt, S., & Saxe, A. (2021). Continual learning in the teacher-student setup: Impact of task similarity. International Conference on Machine Learning, 6109–6119. https://proceedings.mlr.press/v139/lee21e.html?ref=https://githubhelp.com
Saxe, A., Nelli, S., & Summerfield, C. (2021). If deep learning is the answer, what is the question? Nature Reviews Neuroscience, 22(1), 55–67. https://www.nature.com/articles/s41583-020-00395-8
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23), 11537–11546. https://doi.org/10.1073/pnas.1820226116
Abstract
|
DOI
Significance Over the course of development, humans learn myriad facts about items in the world, and naturally group these items into useful categories and structures. This semantic knowledge is essential for diverse behaviors and inferences in adulthood. How is this richly structured semantic knowledge acquired, organized, deployed, and represented by neuronal networks in the brain? We address this question by studying how the nonlinear learning dynamics of deep linear networks acquires information about complex environmental structures. Our results show that this deep learning dynamics can self-organize emergent hidden representations in a manner that recapitulates many empirical phenomena in human semantic development. Such deep networks thus provide a mathematically tractable window into the development of internal neural representations through experience. , An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Generalisation dynamics of online learning in over-parameterised neural networks. arXiv. http://arxiv.org/abs/1901.09085
Abstract
Deep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show how for this problem, the dynamics of SGD are captured by a set of differential equations. In particular, we demonstrate analytically that the generalisation error of the student increases linearly with the network size, with other relevant parameters held constant. Our results indicate that achieving good generalisation in neural networks depends on the interplay of at least the algorithm, its learning rate, the model architecture, and the data set.
Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Generalisation dynamics of online learning in over-parameterised neural networks. arXiv. http://arxiv.org/abs/1901.09085
Abstract
Deep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show how for this problem, the dynamics of SGD are captured by a set of differential equations. In particular, we demonstrate analytically that the generalisation error of the student increases linearly with the network size, with other relevant parameters held constant. Our results indicate that achieving good generalisation in neural networks depends on the interplay of at least the algorithm, its learning rate, the model architecture, and the data set.
Goldt, S., Advani, M., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in Neural Information Processing Systems, 32. https://proceedings.neurips.cc/paper_files/paper/2019/hash/cab070d53bd0d200746fb852a922064a-Abstract.html
Zhang, Y., Saxe, A. M., Advani, M. S., & Lee, A. A. (2018). Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning. Molecular Physics, 116(21-22), 3214–3223. https://doi.org/10.1080/00268976.2018.1483535
DOI
Nye, M., & Saxe, A. (2018). Are Efficient Deep Representations Learnable? arXiv. http://arxiv.org/abs/1807.06399
Abstract
Many theories of deep learning have shown that a deep network can require dramatically fewer resources to represent a given function compared to a shallow network. But a question remains: can these efficient representations be learned using current deep learning techniques? In this work, we test whether standard deep learning methods can in fact find the efficient representations posited by several theories of deep representation. Specifically, we train deep neural networks to learn two simple functions with known efficient solutions: the parity function and the fast Fourier transform. We find that using gradient-based optimization, a deep network does not learn the parity function, unless initialized very close to a hand-coded exact solution. We also find that a deep linear neural network does not learn the fast Fourier transform, even in the best-case scenario of infinite training data, unless the weights are initialized very close to the exact hand-coded solution. Our results suggest that not every element of the class of compositional functions can be learned efficiently by a deep network, and further restrictions are necessary to understand what functions are both efficiently representable and learnable.
Earle, A. C., Saxe, A. M., & Rosman, B. (2017). Hierarchical Subtask Discovery With Non-Negative Matrix Factorization. arXiv. http://arxiv.org/abs/1708.00463
Abstract
Hierarchical reinforcement learning methods offer a powerful means of planning flexible behavior in complicated domains. However, learning an appropriate hierarchical decomposition of a domain into subtasks remains a substantial challenge. We present a novel algorithm for subtask discovery, based on the recently introduced multitask linearly-solvable Markov decision process (MLMDP) framework. The MLMDP can perform never-before-seen tasks by representing them as a linear combination of a previously learned basis set of tasks. In this setting, the subtask discovery problem can naturally be posed as finding an optimal low-rank approximation of the set of tasks the agent will face in a domain. We use non-negative matrix factorization to discover this minimal basis set of tasks, and show that the technique learns intuitive decompositions in a variety of domains. Our method has several qualitatively desirable features: it is not limited to learning subtasks with single goal states, instead learning distributed patterns of preferred states; it learns qualitatively different hierarchical decompositions in the same domain depending on the ensemble of tasks the agent will face; and it may be straightforwardly iterated to obtain deeper hierarchical decompositions.
McClelland, J. L., Sadeghi, Z., & Saxe, A. M. (2016). A Critique of Pure Hierarchy: Uncovering Cross-Cutting Structure in a Natural Dataset. Neurocomputational Models of Cognitive Development and Processing, 51–68. https://doi.org/10.1142/9789814699341_0004
DOI
Saxe, A. M., Earle, A., & Rosman, B. (2016). Hierarchy through Composition with Linearly Solvable Markov Decision Processes. arXiv. http://arxiv.org/abs/1612.02757
Abstract
Hierarchical architectures are critical to the scalability of reinforcement learning methods. Current hierarchical frameworks execute actions serially, with macro-actions comprising sequences of primitive actions. We propose a novel alternative to these control hierarchies based on concurrent execution of many actions in parallel. Our scheme uses the concurrent compositionality provided by the linearly solvable Markov decision process (LMDP) framework, which naturally enables a learning agent to draw on several macro-actions simultaneously to solve new tasks. We introduce the Multitask LMDP module, which maintains a parallel distributed representation of tasks and may be stacked to form deep hierarchies abstracted in space and time.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv. http://arxiv.org/abs/1312.6120
Abstract
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Monajemi, H., Jafarpour, S., Gavish, M., Stat 330/CME 362 Collaboration, Donoho, D. L., Ambikasaran, S., Bacallado, S., Bharadia, D., Chen, Y., Choi, Y., Chowdhury, M., Chowdhury, S., Damle, A., Fithian, W., Goetz, G., Grosenick, L., Gross, S., Hills, G., Hornstein, M., … Zhu, Z. (2013). Deterministic matrices matching the compressed sensing phase transitions of Gaussian random matrices. Proceedings of the National Academy of Sciences, 110(4), 1181–1186. https://doi.org/10.1073/pnas.1219540110
Abstract
|
DOI
In compressed sensing, one takes samples of an N -dimensional vector using an matrix A , obtaining undersampled measurements . For random matrices with independent standard Gaussian entries, it is known that, when is k -sparse, there is a precisely determined phase transition: for a certain region in the ( , )-phase diagram, convex optimization typically finds the sparsest solution, whereas outside that region, it typically fails. It has been shown empirically that the same property—with the same phase transition location—holds for a wide range of non-Gaussian random matrix ensembles. We report extensive experiments showing that the Gaussian phase transition also describes numerous deterministic matrices, including Spikes and Sines, Spikes and Noiselets, Paley Frames, Delsarte-Goethals Frames, Chirp Sensing Matrices, and Grassmannian Frames. Namely, for each of these deterministic matrices in turn, for a typical k -sparse object, we observe that convex optimization is successful over a region of the phase diagram that coincides with the region known for Gaussian random matrices. Our experiments considered coefficients constrained to for four different sets , and the results establish our finding for each of the four associated phase transitions.
Furlanello, T., Zhao, J., Saxe, A. M., Itti, L., & Tjan, B. S. (2016). Active Long Term Memory Networks. arXiv. http://arxiv.org/abs/1606.02355
Abstract
Continual Learning in artificial neural networks suffers from interference and forgetting when different tasks are learned sequentially. This paper introduces the Active Long Term Memory Networks (A-LTM), a model of sequential multi-task deep learning that is able to maintain previously learned association between sensory input and behavioral output while acquiring knew knowledge. A-LTM exploits the non-convex nature of deep neural networks and actively maintains knowledge of previously learned, inactive tasks using a distillation loss. Distortions of the learned input-output map are penalized but hidden layers are free to transverse towards new local optima that are more favorable for the multi-task objective. We re-frame the McClelland’s seminal Hippocampal theory with respect to Catastrophic Inference (CI) behavior exhibited by modern deep architectures trained with back-propagation and inhomogeneous sampling of latent factors across epochs. We present empirical results of non-trivial CI during continual learning in Deep Linear Networks trained on the same task, in Convolutional Neural Networks when the task shifts from predicting semantic to graphical factors and during domain adaptation from simple to complex environments. We present results of the A-LTM model’s ability to maintain viewpoint recognition learned in the highly controlled iLab-20M dataset with 10 object categories and 88 camera viewpoints, while adapting to the unstructured domain of Imagenet with 1,000 object categories.
Goodfellow, I. J., Vinyals, O., & Saxe, A. M. (2015). Qualitatively characterizing neural network optimization problems. arXiv. http://arxiv.org/abs/1412.6544
Abstract
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Goodfellow, I. J., Vinyals, O., & Saxe, A. M. (2015). Qualitatively characterizing neural network optimization problems. arXiv. http://arxiv.org/abs/1412.6544
Abstract
Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Saxe, A., Bhand, M., Mudur, R., Suresh, B., & Ng, A. (2011). Modeling cortical representational plasticity with unsupervised feature learning. Poster Presented at COSYNE, 24–27. http://bipinsuresh.info/papers/ModelingCorticalRepresentationalPlasticityWithUnsupervisedFeatureLearning.pdf
Balci, F., Simen, P., Niyogi, R., Saxe, A., Hughes, J. A., Holmes, P., & Cohen, J. D. (2011). Acquisition of decision making criteria: reward rate ultimately beats accuracy. Attention, Perception, & Psychophysics, 73(2), 640–657. https://doi.org/10.3758/s13414-010-0049-7
DOI
Saxe, A. M. (2013). Precis of deep linear neural networks: A theory of learning in the brain and mind. https://cognitivesciencesociety.org/wp-content/uploads/2019/01/SaxePrecis.pdf
Advani, M. S., Saxe, A. M., & Sompolinsky, H. (2020). High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132, 428–446. https://www.sciencedirect.com/science/article/pii/S0893608020303117