While I'm interested in the topic of the post and have seen plenty of visualisations of balls rolling around hills, I was a little disappointed that it didn't cover the thing that has bugging me for years.
Momentum, or specifically inertia, in physics, what the hell is it? There's a Feynman tale where he asked his father why the ball rolled to the back of a trolly when he pulled the trolley. The answer he received was the usual description of inertia, but also the rarely given insight that describing something and giving it a name is completely different from knowing why it happens.
It's one of those things that I lie in bed thinking about. The other one is position, I can grasp the notion of spacetime and the idea of movement and speed as changes in position in space relative to position in time. I really don't have a grasp of what position is though. I know the name, I can attach the numbers to it, but that doesn't really cover what the numbers are of though.
What specifically do you feel you don't grok about inertia? I'll admit the use of "inertia" for explaining phenomena historically bothered me as it seemed like it was just an extra word that was already covered. Inertia/momentum describes what an object will do in the next instant if nothing else "happens" to the object. Force describes deviation from this according to dp/dt = F. Of course this is in the classical sense.
I'm not sure about position. It's a hard one to think about. What's important is that position's numbers (coordinates ) are defined according to a coordinate system. But the actual physical "position" doesn't care about the coordinate system. So things like distance or the time it takes to get from one point to the other (in some units) are invariant under coordinate changes.
I was curious how well the simple momentum step-size approach shown in the first interactive example compares to alternative methods. The example function featured in the first interactive example is named bananaf ("Rosenbrok Function banana function"), defined as
var s = 3
var x = xy[0]; var y = xy[1]*s
var fx = (1-x)*(1-x) + 20*(y - x*x )*(y - x*x )
var dfx = [-2*(1-x) - 80*x*(-x*x + y), s*40*(-x*x + y)]
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.
From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
'cg': 248
'l-bfgs-b': 40
'm-001-99': 3337
All methods converged in 100 / 100 trials.
m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
Perhaps it is an elementary doubt, but does it all apply to rotational motion? Does a wheel rotating along its own axis continue to rotate in perpetuum, in the absense of friction, air resistance etc?
I agree. Just the usage of animations for explanations was a huge step forward. I wonder why the flagship ML/AI conferences have not adopted the distill digital template for papers yet. I think that would be the first step. The quality would follow
The quality would not follow because Distill.pub publications take literally hundreds of man-hours for the Distill part. Highly skilled man-hours too, to make any of this work on the Web reliably. (Source: I once asked Olah basically the same question: "How many man-hours of work by Google-tier web developers could those visualizations possibly be, Michael? 10?")
I've been wondering at what point AI assistants are going to reduce that to a manageable level? It's unfortunately not obvious what the main bottlenecks are, though Chris and Shan might have a good sense.
It might be doable soon, you're right. But there seems to be a substantial weakness in vision-language-models where they have a bad time with anything involving screenshots, tables, schematics, or visualizations, compared to real-world photographs. (This is also, I'd guess, partially why Claude/Gemini do so badly on Pokemon screenshots without a lot of hand-engineering. Abstract pixel art in a structured UI may be a sort of worst-case scenario for whatever it is they do.) So that makes it hard to do any kind of feedback, never mind letting them try to code interactive visualization stuff autonomously.
Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.
I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.
However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...
More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.
(I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)
While I'm interested in the topic of the post and have seen plenty of visualisations of balls rolling around hills, I was a little disappointed that it didn't cover the thing that has bugging me for years.
Momentum, or specifically inertia, in physics, what the hell is it? There's a Feynman tale where he asked his father why the ball rolled to the back of a trolly when he pulled the trolley. The answer he received was the usual description of inertia, but also the rarely given insight that describing something and giving it a name is completely different from knowing why it happens.
It's one of those things that I lie in bed thinking about. The other one is position, I can grasp the notion of spacetime and the idea of movement and speed as changes in position in space relative to position in time. I really don't have a grasp of what position is though. I know the name, I can attach the numbers to it, but that doesn't really cover what the numbers are of though.
What specifically do you feel you don't grok about inertia? I'll admit the use of "inertia" for explaining phenomena historically bothered me as it seemed like it was just an extra word that was already covered. Inertia/momentum describes what an object will do in the next instant if nothing else "happens" to the object. Force describes deviation from this according to dp/dt = F. Of course this is in the classical sense.
I'm not sure about position. It's a hard one to think about. What's important is that position's numbers (coordinates ) are defined according to a coordinate system. But the actual physical "position" doesn't care about the coordinate system. So things like distance or the time it takes to get from one point to the other (in some units) are invariant under coordinate changes.
What’s hard to understand about position? Isn’t it just a specific coordinate in some space?
I was curious how well the simple momentum step-size approach shown in the first interactive example compares to alternative methods. The example function featured in the first interactive example is named bananaf ("Rosenbrok Function banana function"), defined as
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
[1] https://en.wikipedia.org/wiki/Wolfe_conditions [2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...
now try the same experiment in 1 billion dimensions.
It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
All methods converged in 100 / 100 trials.m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...
Discussed at the time:
Why Momentum Works - https://news.ycombinator.com/item?id=14034426 - April 2017 (95 comments)
Perhaps it is an elementary doubt, but does it all apply to rotational motion? Does a wheel rotating along its own axis continue to rotate in perpetuum, in the absense of friction, air resistance etc?
Distill.pub has such high quality content consistently. It's a shame they don't seem to be active anymore.
I agree. Just the usage of animations for explanations was a huge step forward. I wonder why the flagship ML/AI conferences have not adopted the distill digital template for papers yet. I think that would be the first step. The quality would follow
The quality would not follow because Distill.pub publications take literally hundreds of man-hours for the Distill part. Highly skilled man-hours too, to make any of this work on the Web reliably. (Source: I once asked Olah basically the same question: "How many man-hours of work by Google-tier web developers could those visualizations possibly be, Michael? 10?")
I've been wondering at what point AI assistants are going to reduce that to a manageable level? It's unfortunately not obvious what the main bottlenecks are, though Chris and Shan might have a good sense.
It might be doable soon, you're right. But there seems to be a substantial weakness in vision-language-models where they have a bad time with anything involving screenshots, tables, schematics, or visualizations, compared to real-world photographs. (This is also, I'd guess, partially why Claude/Gemini do so badly on Pokemon screenshots without a lot of hand-engineering. Abstract pixel art in a structured UI may be a sort of worst-case scenario for whatever it is they do.) So that makes it hard to do any kind of feedback, never mind letting them try to code interactive visualization stuff autonomously.
A few comments on this thread:
Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.
I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.
However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...
More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.
(I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)
Only skimped through the article for now, but have to give props to the author - it's beautifully made.
Geez. What a dithering article.