When you write source code, you access the library through an API. Once the code is compiled, your application accesses the binary data in the library through the ABI. The ABI defines the structures and methods that your compiled application will use to access the external library (just like the API did), only on a lower level
An oversimplified summary:
API: "Here are all the functions you may call."
ABI: "This is how to call a function."
Your API defines the order in which you pass arguments to a function. Your ABI defines the mechanics of how these arguments are passed (registers, stack, etc.)
Linux and Windows use different ABIs, so a Windows program won't know how to access a library compiled for Linux.
If the ABI changes but the API does not, then the old and new library versions are sometimes called "source compatible"
Keeping an ABI stable means not changing function interfaces (return type and number, types, and order of arguments), definitions of data types or data structures, defined constants, etc. New functions and data types can be added, but existing ones must stay the same. If, for instance, your library uses 32-bit integers to indicate the offset of a function and you switch to 64-bit integers, then already-compiled code that uses that library will not be accessing that field (or any following it) correctly
Also, if you have an 64-bit OS which can execute 32-bit binaries, you will have different ABIs for 32- and 64-bit code.
The technology we have today, for our current level of living, is like a massive exercise in crowd-sourcing.
One of the most important ones is GPS, before that we had Celestial Navigation. How we were able to create a corpus soo large, yet accurate trough stars across the globe. More context about the thought here :
One question I’ve often asked is:
Why do we have so many neural network optimizers?
Why Adam, SGD, Momentum, RMSProp, AdamW, Adagrad…
Why not just pick one and use it everywhere?
Surprisingly, the best way to understand this is to look at something much older and much more mature:
Numerical Linear Algebra.
If you’ve ever studied how we solve linear systems or least squares problems — LU, QR, Cholesky, Conjugate Gradient, GMRES, etc. — the pattern is exactly the same:
Different landscapes require different algorithms because stability, conditioning, and convergence behavior vary.
This post is a clean summary of that analogy — the one that helped me deeply understand optimizers in ML.
Modern neural networks are trained by minimizing a loss function:
minθf(θ)\min_\theta f(\theta)θminf(θ)
This is a numerical optimization problem with millions of variables.
And like all numerical problems, the main enemies are:
Ill-conditioning
Sensitivity to step size
Stability of the update rule
Convergence speed
These issues are exactly the same ones faced in linear algebra when solving:
Ax=bAx = bAx=b
Why do linear algebraists not use one universal algorithm for everything?
Because:
Some matrices are well-conditioned → easy to solve.
Some are ill-conditioned → require pivoting or regularization.
Some are symmetric positive definite → CG is perfect.
Some are nonsymmetric → GMRES works better.
Some are sparse → direct solvers are too expensive.
Some demand stability above all else → QR beats LU.
There is no one-size-fits-all solver — because the underlying numerical landscape differs.
Neural Network Optimizer
Numerical Linear Algebra Analogy
SGD
Basic iterative descent (like Richardson iteration)
Momentum
Krylov acceleration / heavy-ball method
RMSProp / Adam
Diagonal preconditioning
AdamW
Adam + correct regularization (better-conditioned)
Adagrad
Natural scaling for sparse matrices
L-BFGS
Quasi-Newton methods
The analogy is not loose — it is structural.
The deep insight is:
Neural network loss surfaces are extremely ill-conditioned.
Some directions in parameter space are very steep.
Some are very flat.
Some are almost saddle-shaped.
This is exactly like solving a system with a matrix AAA whose singular values vary wildly.
In steep directions → large steps explode.
In flat directions → tiny steps make training stall.
This is the conditioning problem, identical to why iterative solvers struggle with bad matrices.
5. Optimizers Are Just Preconditioners in Disguise
In numerical linear algebra, we improve convergence using a preconditioner:
M−1Ax=M−1bM^{-1}Ax = M^{-1}bM−1Ax=M−1b
The purpose is to change the geometry of the problem so gradient steps behave nicely.
Now look at Adam:
θt+1=θt−αmtvt+ϵ\theta_{t+1} = \theta_t - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}θt+1=θt−αvt+ϵmt
Here vtv_tvt is an estimate of the per-coordinate second moment — basically a diagonal approximation of the Hessian.
That means:
Adam = Gradient descent with a diagonal preconditioner.
RMSProp, Adagrad, AdamW — all variations of the same idea.
This is the same question as:
“Why not use LU for every matrix?”
Because while Adam stabilizes optimization, it sometimes hurts generalization.
SGD + Momentum often finds “flatter minima” — analogous to well-conditioned solutions in linear algebra that generalize better.
Depending on the loss landscape and data structure:
Adam is fast and stable (like QR factorization).
SGD with momentum generalizes better (like CG on SPD matrices).
Adagrad is great for sparse gradients (like solvers specialized for sparse matrices).
AdamW fixes Adam’s incorrect weight decay (analogy: pivoted LU vs. unpivoted LU).
7. The Final Takeaway
The diversity of ML optimizers is not arbitrary.
It is a natural consequence of numerical stability and conditioning — foundational ideas in numerical linear algebra.
You can summarize everything in one sentence:
Optimizers in deep learning are just stabilization and preconditioning strategies for an extremely large, ill-conditioned numerical optimization problem.
This perspective unifies ML training with the rich theory of numerical algorithms.
Once you see the analogy, optimizer choice stops feeling mysterious and starts feeling logical.