I can’t believe the author wrote that without explaining why it’s called the bell curve.
I find the article spends a lot of time talking about repeating games without really getting to the meat of it.
If you throw a dice a million times the result is still following a uniform distribution.
It isn’t until you start summing random events that the normal distribution occurs.
https://en.wikipedia.org/wiki/Infinite_divisibility_(probabi...
https://en.wikipedia.org/wiki/Stable_distribution
This applies even when the variance is not finite.
Note independence and identical nature of distribution is not necessary for Central Limit Theorem to hold. It is a sufficient condition, not a necessary one, however, it does speed up the convergence a lot.
Gaussian distribution is a special case of the infinitely divisible distribution and is the most analytically tractable one in that family.
Whereas, averaging gives you Gaussian as long as the original distribution is somewhat benign, the MAX operator also has nice limiting properties. They converge to one of three forms of limiting distributions, Gumbel being one of them.
The general form of the limiting distributions when you take MAX of a sufficiently large sample are the extreme value distributions
https://en.wikipedia.org/wiki/Generalized_extreme_value_dist...
Very useful for studying record values -- severest floods, world records of 100m sprints, world records of maximum rainfall in a day etc
This convolution can be understood as a matrix multiplication by a specific symmetric matrix.
Anyone familiar with linear algebra will know that repeated matrix multiplication by non degenerate matrices reveals it's eigenvectors.
The Gaussian distribution is such an eigenvector. Just like an eigenvector, it is also a fixed point -- multiplying again by the same matrix wil lead to the same vector, just scaled. The Gaussian distribution smoothened by a convolution is again a Gaussian distribution.
This is exactly what happens with Gaussians -- the addition is a matrix multiplication in the distribution space and the division by the the 'total' in the averaging takes care of the scaling.
Linear algebra is amazing.
TIL that I'm not "familiar" with linear algebra ;)
But seriously, thanks for sharing that knowledge.
So humble and basic a field. So wide it's consequences and scope.
My expression of gratitude was sincere.
The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.
Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.
For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.
In finance, the effects of random factors tend to multiply. So you get a log-normal curve.
As Taleb points out, though, the underlying assumptions behind log-normal break in large market movements. Because in large movements, things that were uncorrelated, become correlated. Resulting in fat tails, where extreme combinations of events (aka "black swans") become far more likely than naively expected.
https://en.wikipedia.org/wiki/Central_limit_theorem#Dependen...
I know you know that and were just simplifying. Just wanted this fact to be better known for practitioners. Your comment on multiplicative processes is spot on.
I say more here
https://news.ycombinator.com/item?id=47437152
It's bit of a shame that these other limiting distributions are not as tractable as the Gaussian.
a) the CLT requires samples drawn from a distribution with finite mean and variance
and b) the Gaussian is the maximum entropy distribution for a particular mean and variance
I’d be curious about what happens if you starting making assumptions about higher order moments in the distro
The most interesting assumptions to relax are the independence assumptions. They're way more permissive than the textbook version suggests. You need dependence to decay fast enough, and mixing conditions (α-mixing, strong mixing) give you exactly that: correlations that die off let the CLT go through essentially unchanged. Where it genuinely breaks is long-range dependence -fractionally integrated processes, Hurst parameter above 0.5, where autocorrelations decay hyperbolically instead of exponentially. There the √n normalization is wrong, you get different scaling exponents, and sometimes non-Gaussian limits.
There are also interesting higher order terms. The √n is specifically the rate that zeroes out the higher-order cumulants. Skewness (third cumulant) decays at 1/√n, excess kurtosis at 1/n, and so on up. Edgeworth expansions formalize this as an asymptotic series in powers of 1/√n with cumulant-dependent coefficients. So the Gaussian is the leading term of that expansion, and Edgeworth tells you the rate and structure of convergence to it.
(I know it is very easy to do "maths" this way).
If I'm remembering it correctly it's interesting to think about the ramifications of that for the moments.
to me it results of 2 factors - 1. Gaussian is the max entropy for a distribution with a given variance and 2. variance is the model of energy-limited behavior whereis physical processes are always under some energy limits. Basically it is the 2nd law.
https://en.wikipedia.org/wiki/Galton_board
at the (I think) Boston Science Museum when I was a kid. They have some pretty cool videos on Youtube if you're curious.
Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.
BUT for the exceptional world, causes multiply or cascade: earthquake magnitudes, network connectivity, etc. So, you get log-normal or fat-tailed.
This is one of the most fundamental things to understand in statistics. If you don't have at least some degree of comfort with this, you have no business working with data in a professional capacity.
The way I understand it, OP asked this as a way to open the conversation, while candidates interpreted it as a math problem to solve, unintentionally getting their mind into "exam" mode.
All summation roads lead to normal curves. (There might be an exception for weird probability distributions that do not have a mean; I was surprised when I learned these exist.)
Life is full of sums. Height? That's a sum of genetics and nutrition, and both of those can be broken down into other sums. How long the treads last on a tire? That's a sum of all the times the tire has been driven, and all of those times driving are just sums of every turn and acceleration.
I'm not a data scientist. I'm just a programmer that works with piles of poorly designed business logic.
How did I do in my interview? (I am looking for a job.)
If I had made the extra condition that the random variables had finite variance, you'd be correct. Without the finite variance condition, the distribution is Levy stable.
Levy stable distributions can have finite mean but infinite variance. They can also have infinite mean and infinite variance. Only in the finite mean and finite variance case does it imply a Gaussian.
Levy stable distributions are also called "fat-tailed", "heavy-tailed" or "power law" distributions. In some sense, Levy stable distributions are more normal than the normal distribution. It might be tempting to dismiss the infinite variance condition but, practically, this just means you get larger and larger numbers as you draw from the distribution.
This was one of Mandelbrot's main positions, that power laws were much more common than previously thought and should be adopted much more readily.
As an aside, if you do ever get asked this in an interview, don't expect to get the job if you answer correctly.
But the counterintuitive thing about the CLT is that it applies to distributions that are not normal.
For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.
The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).
Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.
Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.
Widths of different uniform distributions along with different centers all still have a quadratic center, so the above argument only needs to be minimally changed.
The added bonus is that if the (1-w^2)^n is replaced by (1-w^a)^n, you can sort of see how to get at the Levy stable distribution (see the characteristic function definition [0]).
The point is that this gives a simple, high-level motivation as to why it's so common. Aside from seeing this flavor of proof in "An Invitation to Modern Number Theory" [1], I haven't really seen it elsewhere (though, to be fair, I'm not a mathematician). I also have never heard the connection of this method to the Levy stable distributions but for someone communicating it to me personally.
I disagree about the audience for Quanta. They tend to be exposed to higher level concepts even if they don't have a lot of in depth experience with them.
[0] https://en.wikipedia.org/wiki/Stable_distribution#Parametriz...
[1] https://www.amazon.com/Invitation-Modern-Number-Theory/dp/06...
The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.
The central limit theorem generalizes beyond simple math to hard math: Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.
We can use Calculus to do so much but also so little…
It is certainly possible that there are complex approaches that the statisticians have not discovered or don't teach because they are too complicated, but they had a big fight about which techniques were provably superior early in the discipline's history and the choices of what got standardised on weren't because of ease of calculation. It has actually been quite interesting how little interest the statisticians are likely to be taking in things like the machine learning revolution since the mathematics all seems pretty amenable to last century's techniques despite orders of magnitude differences in the data being handled.
In practice when modeling you are almost always better not assuming normality, and you want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).
Statisticians love averages so everywhere that could be sampled as a normal distribution will be presented as one
The median is actually more descriptive and power law is equally as pervasive if not more
* excluding bizarre degenerates like constants or impulse functions
He has several other related videos also.
https://www.youtube.com/@3blue1brown/search?query=convolutio...
No, the central limit theorem specifically doesn't address that. It says that the sum of iid random variables is well approximated by a normal distribution near the mean; it doesn't tell you how well that approximation works in the tails. The rarer the event you're modeling is, the less relevant the normal approximation is.
What are "most cases"?
Unfortunately, many "researchers" blindly assume that many real life phenomena follow Gaussian, which they don't... So then their models are skewed
> suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.
> Laplace distilled this structure into a simple formula, the one that would later be known as the central limit theorem. No matter how irregular a random process is, even if it’s impossible to model, the average of many outcomes has the distribution that it describes. “It’s really powerful, because it means we don’t need to actually care what is the distribution of the things that got averaged,” Witten said. “All that matters is that the average itself is going to follow a normal distribution.”
This is not really true, because the central limit theorem requires a huge assumption: that the random process has finite variance. I believe that distributions that don't satisfy that assumption, which we can call heavy-tailed distributions, are much more common in the real world than this discussion suggests. Pointing out that infinities don't exist in the real world is also missing the point, since a distribution that just has a huge but finite variance will require a correspondingly huge number of samples to start behaving like a normal distribution.
Apart from the universality, the normal distribution has a pretty big advantage over others in practice, which is that it leads to mathematical models that are tractable in practice. To go into a slightly more detail, in mathematical modeling, often you define some mathematical model that approximates a real-world phenomenon, but which has some unknown parameters, and you want to determine those parameters in order to complete the model. To do that, you take measurements of the real phenomenon, and you find values for the parameters that best fit the measurements. Crucially, the measurements don't need to be exact, but the distribution of the measurement errors is important. If you assume the errors are independent and normally distributed, then you get a relatively nice optimization problem compared to most other things. This is, in my opinion, about as much responsible for the ubiquity of normal distributions in mathematical modeling as the universality from the central limit theorem.
However, as most people who solve such problems realize, sometimes we have to contend with these things called "outliers," which by another name are really samples from a heavy-tailed distribution. If you don't account for them somehow, then Bad Things(TM) are likely to happen. So either we try to detect and exclude them, or we replace the normal distribution with something that matches the real data a bit better.
Anyway, to connect this all back to the central limit theorem, it's probably fair to say measurement errors tend to be the combined result of many tiny unrelated effects, but the existence of outliers is pretty strong evidence that some of those effects are heavy-tailed and thus we can't rely on the central limit theorem giving us a normal distribution.
The point on convergence rates re: the central limit theorem is also a major point otherwise clever people tend to miss, and which comes up in a lot of modeling contexts. Many things which make sense "in the limit" likely make no sense in real world practical contexts, because the divergence from the infinite limit in real-world sizes is often huge.
EDIT: Also from a modeling standpoint, say e.g. Bayesian, I often care about finding out something like the "range" of possible results for (1) a near-uniform prior, (2), a couple skewed distributions, with the tail in either direction (e.g. some beta distributions), and (3) a symmetric heavy-tailed distribution (e.g. Cauchy). If you have these, anything assuming normality is usually going to be "within" the range of these assumptions, and so is generally not anything I would care about.
Basically, in practical contexts, you care about tails, so assuming they don't meaningfully exist is a non-starter. Looking at non-robust stats of any kind today, without also checking some robust models or stats, just strikes me as crazy.
The sum of independent identically distributed random variables, if they converge at all, converge to a Levy stable distribution (aka fat-tailed, heavy tailed, power law). In this sense, Levy stable distributions are more "normal" than the normal distribution. They also show up with regular frequency all over nature.
As you point out, infinite variance might be dismissed but, in practice, this just ends up getting larger and larger "outliers" as one keeps drawing from the distribution. Infinities are, in effect, a "verb" and so an infinite variance, in this context, just means the distributions spits out larger and larger numbers the more you sample from it.
This is a tautology to the extreme.
If sums of independent identically distributed random variables converge to a distribution, they converge to a Levy stable distribution [0]. Tails of the Levy stable distribution are power law, which makes them not Gaussian.
Second, your "aka" is incorrect --- there is all sorts of clumping that is not a normal distribution.
> your "aka" is incorrect --- there is all sorts of clumping that is not a normal distribution.
That it's "incredibly common for people to label "bell curves" by eyeball, regardless of whether they are normal curves" is not just not relevant, it's anti-relevant ... the central limit theorem says that the distribution of the means is always a bell curve--a normal distribution--not merely a "bell curve".
Anyway, this is covered in far more detail in other comments and material elsewhere, so this is my last contribution.
a vast amount of fluff for less than a college statistics professor would (hopefully) be able to impart with a chalkboard in 10 minutes, when Quanta has the ability to prepare animated diagrams like 3Blue1Brown but chooses not to use it
they could go down myriad paths, like how it provides that random walks on square lattices are asymptotically isotropic, or give any other simple easy-to-understand applications (like getting an asymptotic on the expected # of rolls of an n-sided die before the first reoccurring face) or explain what a normal distribution is, but they only want to tell a story to convey a feeling
they are a blight upon this world for not using their opportunity to further public engagement in a meaningful way
Perhaps you're just not in their intended audience?
https://news.ycombinator.com/item?id=45800657
3b1b doesn't have the same goal as Quanta, or as introductory guides. It's actually not that great a teaching tool (it's truly great at what it is for, which is (a) appreciation and motivation, and (b) allowing people to signal how smart they are on message board threads by talking about how much people would get out of watching 3b1b).
This is prose writing about math. It's something you're meant to read for enjoyment. If you don't enjoy it, fine; I don't enjoy cowboy fiction. So I don't read it. I don't so much look for opportunities to yell at how much I hate "The Ballad of Easy Breezy".
My compliant is only that there should be a dozen more just like them, each competing with each other for the best, most engaging math and science content. This would allow for more a broader audience skillevel to be reached.
As it stands, we’re lucky even to have Quanta and 3b1b.
I think there is hope though, quite a few new-ish creators on YouTube are following in Grant’s footsteps and producing very technically detailed and informative content at similar quality levels.
by the metric of "if this expository piece were to be taken to a time before its subject had been considered and presented to researchers, how useful would its outline be towards reproducing the theory in its totality," Quanta's writings (on both classical and research math) mostly score 0
Seems a bit like Ted Talks. Lightweight popcorn for the simple minded.