This workshop aims to bring together researchers in stochastic analysis, statistics and theoretical machine learning for an exchange of ideas at the forefront of the field. The workshop will coincide with the visit of Professor Gerard Ben Arous, a leading expert in stochastic analysis and high-dimensional statistics, whose insights into deep learning theory offer an exceptional opportunity for meaningful collaborations. The event will feature a series of presentations and discussions around the mathematical underpinnings of modern machine learning techniques, including topics such as:
- Theoretical analysis of deep learning architectures
- High-dimensional statistics and learning theory
- Diffusion models
- Connections between stochastic differential equations and neural networks
Confirmed speakers
- (Courant Institute, NYU)
- (DeepMind)
- (51勛圖厙)
- (51勛圖厙)
- (51勛圖厙)
- (51勛圖厙)
- (Bath)
- (51勛圖厙)
- (Oxford)
Schedule
9.15 9.30 Registration and Welcome
9.30 10.00泭 Andrew Duncan
10.00 10.30泭 Nicola M. Cirone泭
泭10.30 11.00 Deniz Akyildiz
泭11.00 11.30 Coffee break
泭11.30 12.15泭泭Gerard Ben Arous泭
泭12.15 12.45泭 Will Turner
泭12.45 14.00 Lunch @ The Works, Sir Michael Uren Building, London W12 0BZ (by invitation only)
泭14.00 14.45泭泭 Arnaud Doucet泭
泭14.45 15.15泭 James Foster
泭15.15 15.45 Coffee break
泭15.45 16.30泭 Harald Oberhauser
泭16.30 17.00泭 Yingzhen Li
18.30 Conference dinner @ The Broadcaster, 89 Wood Ln, London W12 7FX (by invitation only)
Titles and abstracts
Title: Effective dynamics for summary statistics in high dimensional optimization: can the spectral point of view be sharp?
Abstract: I will present a survey of the recent progress on the notion of summary statistics and effective dynamics for the natural optimization tasks needed for high dimensional data science and machine learning. The main idea is that in many problems in very high dimension, most of the action happens locally in a low dimensional space. The projection on these spaces (the summary statistics) follows (possibly complex) autonomous dynamics (the effective dynamics) which carry the whole information about the success of the optimization task. The hard part is often to find these summary statistics, and then a dynamical spectral approach is useful, as the Hessian (and Fisher information matrix) develop a spectral BBP transition along the training process when the signal to noise ratio is strong enough. I will illustrate this with a brand new result about a sharp dynamical spectral transition in a central example of ML, i.e. multilayered Neural Nets for classification tasks. This line of ideas was introduced in a line of works started with Reza Gheissari and Aukosh Jagganath and was then developed jointly with Jiaoyang Huang for the spectral approach. If time permits, I will also show how these effective dynamics work for the case of multi-spike Tensor PCA (which is taken from a series of recent joint works with Cedric Gerbelot and Vanessa Piccolo). There the spectral transition is yet to be studied.
Title: Accelerated Diffusion Models via Speculative Sampling
Abstract: Speculative sampling is a popular technique for accelerating inference in Large Language Models (LLMs) by generating candidate tokens using a fast draft model and accepting or rejecting them based on the target models distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vector-valued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out of the box to any diffusion model. Our experiments demonstrate significant generation speedup on various
diffusion models while generating exact samples from the target model.
Title: Diffusion-based Learning of Latent Variable Models泭
Abstract: In this talk, I will summarize recent progress and challenges in maximum marginal likelihood estimation (MMLE) for learning latent variable models (LVMs) focusing on the methods based on Langevin diffusions. I will first introduce the problem and the necessary background on Langevin diffusions, together with recent results on Langevin-based MMLE estimators, detailing the interacting particle Langevin algorithm (IPLA) which is a recent Langevin-based MMLE method with explicit theoretical guarantees akin to Langevin Monte Carlo methods. I will then move on to outline recent progress, specifically accelerated variants, and methods for MMLE in nondifferentiable statistical models with convergence and complexity results. Finally, if time permits, I will talk about the application of IPLA to inverse problems.
Title: Computable Statistical泭Divergences for Functional Data
Abstract: Kernel-based discrepancies have found considerable success in constructing statistical tests which are now widely used in statistical machine learning. 泭Examples include Kernel Stein Discrepancy which enables goodness-of-fit tests of data samples against an (unnormalized) 泭probability density based on Steins method. 泭The effectiveness of the associated tests will crucially depend on the dimension of the data.泭I will present some recent results on the behaviour of such tests in high dimensions, 泭exploring properties of the泭statistical泭divergence under different scaling of data dimension and data size. 泭Building on this, I will discuss how such泭discrepancies泭can be extended to probability distributions on infinite-dimensional spaces.泭 I will discuss applications to goodness-of-fit testing for measures on function spaces and its relevance to various problems in ML.
Title: Efficient, Accurate and Stable Gradients for Neural Differential Equations
Abstract: Neural differential equations (NDEs) sit at the intersection of two dominant modelling paradigms neural networks and differential equations. One of their features is that they can be trained with a small memory footprint through adjoint equations. This can be helpful in high-dimensional applications since the memory usage of standard backpropagation scales linearly with depth (or, in the NDE case, the number of steps taken by the solver). However, adjoint equations have seen little use in practice as the resulting gradients are often inaccurate.泭Fortunately, there has emerged a class of numerical methods which allow NDEs to be trained using gradients that are both accurate and memory efficient. These solvers are known as algebraically reversible and produce numerical solutions which can be reconstructed backwards in time. Whilst algebraically reversible solvers have seen some success in large-scale applications, they are known to have stability issues.泭In this talk, we propose a methodology for constructing reversible NDE solvers from non-reversible ones. We show that the resulting reversible solvers converge in the ODE setting, can achieve high order convergence, and even have stability regions. We conclude with a few examples demonstrating the memory efficiency of our approach.泭Joint work with Samuel McCallum.
Title: Graph Expansions of Deep Neural Networks and their Universal Scaling Limits
Abstract: We present a unified approach to obtain scaling limits of neural networks using the genus expansion technique from random matrix theory. This approach begins with a novel expansion of neural networks which is reminiscent of Butcher series for ODEs, and is obtained through a generalisation of Fa di Brunos formula to an arbitrary number of compositions. In this expansion, the role of monomials is played by random multilinear maps indexed by directed graphs whose edges correspond to random matrices, which we call operator graphs. This expansion linearises the effect of the activation functions, allowing for the direct application of Wicks principle to compute the expectation of each of its terms. We then determine the leading contribution to each term by embedding the corresponding graphs onto surfaces, and computing their Euler characteristic. Furthermore, by developing a correspondence between analytic and graphical operations, we obtain similar graph expansions for the neural tangent kernel as well as the input-output Jacobian of the original neural network, and derive their infinite-width limits with relative ease. Notably, we find explicit formulae for the moments of the limiting singular value distribution of the Jacobian. We then show that all of these results hold for networks with more general weights, such as general matrices with i.i.d. entries satisfying moment assumptions, complex matrices and sparse matrices.
Title: Randomised path developments, and signature kernels as universal scaling limits
Abstract: Scaling limits of random developments of a path into a matrix Lie group have recently been used to construct signature-based time series kernels. General linear group developments have been shown to be connected to the ordinary signature kernel (Mu癟a Cirone et al.), while unitary developments have been used to construct the path characteristic function distance (Lou et al.) which has proven a successful discriminator for generative modelling tasks. By leveraging the tools of random matrix theory and free probability theory, we are able to provide a unified treatment of both limits泭under general assumptions on the randomisation. For unitary developments, we show that the limiting kernel is given by the contraction of a signature against the monomials of freely independent semicircular random variables. Using the Schwinger-Dyson equations, we show that this kernel can be obtained by solving a novel quadratic functional equation. We will also discuss extensions to a class of Hermitian matrix models, whose limiting Schwinger-Dyson equations lead to path-dependent functional equations. Joint work with Thomas Cass, Samuel Crew and Cristopher Salvi.