Statisticians use a technique that leverages randomness to deal with the unknown

141 points by Duximo 9 months ago

Jun8 9 months ago

Not one mention of the EM algorithm, which is, as far as I can understand, is being described here (https://en.m.wikipedia.org/wiki/Expectation%E2%80%93maximiza...). It has so many applications, among which is estimating number of clusters for a Gaussian mixture model.

An ELI5 intro: https://abidlabs.github.io/EM-Algorithm/

Sniffnoy 9 months ago

It does not appear to be what's being described here? Could you perhaps expand on the equivalence between the two if it is?
CrazyStat 9 months ago

EM can be used to impute data, but that would be single imputation. Multiple imputation as described here would not use EM since the goal is to get samples from a distribution of possible values for the missing data.
- wdkrnls 9 months ago
  
  In other words, EM makes more sense. All this imputation stuff seems to me more like an effort to keep using obsolete modeling techniques.
  - CrazyStat 9 months ago
    
    Absolutely not.
    EM imputation (or single imputation in general) fails to account for the uncertainty in imputed data. You end up with artificially inflated confidence in your results (p-values too small, confidence/credible intervals too narrow, etc.).
    Multiple imputation is much better.
miki123211 9 months ago

> It has so many applications, among which is estimating number of clusters for a Gaussian mixture model
Any sources for that? As far as I remember, EM is used to calculate actual cluster parameters (means, covariances etc), but I'm not aware of any usage to estimate what number of clusters works best.
Source: I've implemented EM for GMMs for a college assignment once, but I'm a bit hazy on the details.
- fleischhauf 9 months ago
  
  you are right you still need the number of clusters
  - BrokrnAlgorithm 9 months ago
    
    I've been out of the loop for stats for a while, but is there a viable approach for estimating ex ante the number of clusters when creating a GMM? I can think if constructing ex post metrics, i.e using a grid and goodness of fit measurements, but these feel more like brute forcing it
    
    lukego 9 months ago
    
    Is the question fundamentally: what's the relative likelihood of each number or clusters?
    If so then estimating the marginal likelihood of each one and comparing them seems pretty reasonable?
    (I mean in the sense of Jaynes chapter 20.)
    
    disgruntledphd2 9 months ago
    
    Unsupervised learning is hard, and the pick K problem is probably the hardest part.
    For PCA or factor analysis, there's lots of ways but without some way of determining ground truth it's difficult to know if you've done a good job.
    
    CrazyStat 9 months ago
    
    There are Bayesian nonparametric methods that do this by putting a dirichlet process prior on the parameters of the mixture components. Both the prior specification and the computation (MCMC) are tricky, though.

clircle 9 months ago

Does any living statistician come close to the level of Donald Rubin in terms of research impact? Missing data analysis, causal inference, EM algorithm, any probably more. He just walks around creating new subfields.

selimthegrim 9 months ago

Efron?
- richrichie 9 months ago
  
  & Tibshirani
  - selimthegrim 9 months ago
    
    Stein too
j7ake 9 months ago

Mike Jordan, Tibshirani, Emmanuel Candes
selectionbias 9 months ago

Also approximate Bayesian computation, principal stratification, and the Bayesian Bootstrap.
aquafox 9 months ago

Andrew Gelman?
- nabla9 9 months ago
  
  Gelman has contributed to Bayesianism, hierarchial models and Stan is great, but that's not even close to what Rubin has done.
  ps. Gelman was Rubin's doctoral student.

xiaodai 9 months ago

I don’t know. I find quanta articles very high noise. It’s always hyping something

jll29 9 months ago

I don't find the language of the article full of "hype"; they describe the history of different forms of imputation from single to multiple to ML-based.
The table is particularly useful as it describes what the article is all about in a way that can stick to students' minds. I'm very grateful for QuantaMagazine for its popular science reporting.
- billfruit 9 months ago
  
  The Quanta articles usually have a gossipy style and are very low information density.
  - SAI_Peregrinus 9 months ago
    
    They're usually more science history than science. Who did what, when, and a basic ovnrview of why it's important.
vouaobrasil 9 months ago

I agree with that. I skip the Quanta magazine articles, mainly because the titles seem to be a little to hyped for my taste and don't represent the content as well as they should.
- xiaodai 9 months ago
  
  Same. I think it does disservice to math.
- amelius 9 months ago
  
  Yes, typically a short conversation with an LLM gives me more info and understanding of a topic than reading a Quanta article.
MiddleMan5 9 months ago

Curious, what sites would you recommend?

TaurenHunter 9 months ago

Donald Rubin is kind of a modern day Leibniz...

Rubin Causal Model

Propensity Score Matching

Contributions to

Bayesian Inference

Missing data mechanisms

Survey sampling

Causal inference in observations

Multiple comparisons and hypothesis testing

light_hue_1 9 months ago

I wish they actually engaged with this issue instead of writing a fluff piece. There are plenty of problems with multiple imputation.

Not the least of which is that it's far too easy to do the equivalent of p hacking and get your data to be significant by playing games with how you do the imputation. Garbage in, garbage out.

I think all of these methods should be abolished from the curriculum entirely. When I review papers in the ML/AI I automatically reject any paper or dataset that uses imputation.

This is all a consequence of the terrible statics used in most fields. Bayesian methods don't need to do this.

jll29 9 months ago

There are plenty of legit. articles that discuss/survey imputation in ML/AI: https://scholar.google.com/scholar?hl=de&as_sdt=0%2C5&q=%22m...
- light_hue_1 9 months ago
  
  The prestigious journal "Artificial intelligence in medicine"? No. Just because it's on Google scholar doesn't mean it's worth anything. These are almost all trash. On the first page there's one maybe legit paper in an ok venue as far as ML is concerned (KDD; an adjacent field to ML) that's 30 years old.
  No. AI/ML folks don't do imputation on our datasets. I cannot think of a single major dataset in vision, nlp, or robotics that does so. Despite missing data being a huge issue in those fields. It's an antiqued method for an antiqued idea of how statistics should work that is doing far more damage than good.
  - disgruntledphd2 9 months ago
    
    Ok that's interesting. I profoundly disagree with your tone, but would really like to hear with you regard as good approaches to the problem of missing data (particularly where you have dropout from a study or experiment).
    
    nyrikki 9 months ago
    
    Perhaps looking into the issues with uncongeniality and multiple imputation may help, although I haven't looked at MI for a a long time so consider my reply as an attempt to be helpful vs authoritive.
    In another related intuition for a probable foot gun relates to learning linearly inseparable functions like XOR which requires MLPs.
    A single missing value in an XOR situation is far more challenging than participant dropouts causing missing data.
    Specifically the problem is counterintuitively non-convex, with multiple possibilities for convergence without information in the corpus to know which may be true.
    That is a useful lens in my mind, where I think of the manifold being pushed down in opposite sectors as the kernel trick.
    Another potential lens to think about it is that in medical studies the assumption is that there is a smooth and continuous function, while in learning, we are trying to find a smooth continuous function with minimal loss.
    We can't assume that the function we need to learn is smooth, but autograd specifically limits what is learnable and simplicity bias, especially with feed forward networks is an additional concern.
    One thing that is common for people to conflate is the fact that a differentiable function is probably smooth and continuous.
    But the set of continuous functions that is differentiable _anywhere_ is a meger set.
    Like anything in math and logic, the assumptions you can make will influence what methods work.
    As ML is existential quantification, and because it is insanely good at finding efficient glitches in the matrix, within the limits of my admittedly limited knowledge, MI would need to be a very targeted solution with a lot of care to avoid set shattering from causing uncongeniality, especially in the unsupervised context.
    Hopefully someone else can provide a better productive insights.
    
    disgruntledphd2 9 months ago
    
    Honestly, I think that we're coming at this from very different perspectives.
    Single imputation is garbage for accurate inference, as it reduces variance and thus confidence intervals as P(missing) increases.
    MI is a useful method for alleviating this bias (though at the cost of a lot more compute).
    That's why it gets used, and it's performed extremely well in real world analyses for basically my entire life (and I'm middle-aged now).
    > especially in the unsupervised context.
    I wouldn't use MI in an unsupervised context (but maybe some people do).
parpfish 9 months ago

I feel like multiple imputation is fine when you have data missing at random.
The problem is that data is never actually missing at random and there’s always some sort of interesting variable that confounds which pieces are missing
- underbiding 9 months ago
  
  True true but how do you account for missing data based on variables you care about and those you don't?
  More specifically, how do you determine if the pattern you seem to be identifying is actually related to the phenomenon being measured and not an error in the measurement tools themselves?
  For example, a significant pattern of answers to "Yes / No: have you ever been assaulted?" are blank. This could be (A), respondents who were assaulted are more likely to leave it blank out of shame or (B) someone handling the spreadsheet accidentally dropped some rows in the data (because lets be serious here, its all spreadsheets and emails...).
  While you could say that (B) should be theoretically "more truly random", we can't assume that there isn't a pattern to the way those rows were dropped (i.e. a pattern imposed on some algorithm that bugged out and dropped those rows).
  - Xcelerate 9 months ago
    
    > how do you determine if the pattern you seem to be identifying is actually related to the phenomenon being measured and not an error in the measurement tools themselves?
    If the “which data is missing” information can be used be to compress the data that isn’t missing further than it can be compressed be alone, then the missing data is missing at least in part due to the phenomenon being measured. Otherwise, it’s not.
    We’re basically just asking if K(non-missing data | which data is missing) < K(non-missing data). This is uncomputable so it doesn’t actually answer your question regarding “how to determine”, but it does provide a necessary and sufficient theoretical criteria.
    A decent practical approximation might be to see if you can develop a model that predicts the non-missing data better when augmented with the “which information is missing” information than via self-prediction. That could be an interesting research project...
    
    parpfish 9 months ago
    
    There’s already a bunch of stats research on this problem. Some useful terms to look up are MCAR (missing completely at random) and MNAR (missing not at random)
DAGdug 9 months ago

Maybe in academia, where sketchy incentives rule. In industry, p-hacking is great till you’re eventually caught for doing nonsense that isn’t driving real impact (still, the lead time is enough to mint money).
- light_hue_1 9 months ago
  
  Very doubtful. There are plenty of drugs that get approved and are of questionable value. Plenty of procedures that turn out to be not useful. The incentives in industry are even worse because everything depends on lying with data if you can do it.
  - hggigg 9 months ago
    
    Indeed. Even worse some entire academic fields are built on pillars of lies. I was married to a researcher in one of them. Anything that compromises the existence of the field just gets written off. The end game is this fed into life changing healthcare decisions so one should never assume academia is harmless. This was utterly painful watching it from the perspective of a mathematician.
  - nerdponx 9 months ago
    
    I assume by "in industry" they meant in jobs where you are doing data analysis to support decisions that your employer is making. This would be any typical "data scientist" job nowadays. There the consequences of BSing are felt by the entity that pays you, and will eventually come back around to you.
    The incentives in medicine are more similar to those in academia, where your job is to cook up data that convinces someone else of your results, with highly imbalanced incentives that reward fraud.
    
    DAGdug 9 months ago
    
    Yes, precisely this! I’ve seen more than a few people fired for generating BS analyses that didn’t help their employer, especially in tech where scrutiny is immense when things start to fail.
aabaker99 9 months ago

My intuition would be that there are certain conditions under which Bayesian inference for the missing data and multiple imputation lead to the same results.
What is the distinction?
The scenario described in the paper could be represented in a Bayesian method or not. “For a given missing value in one copy, randomly assign a guess from your distribution.” Here “my distribution” could be Bayesian or not but either way it’s still up to the statistician to make good choices about the model. The Bayesian can p hack here all the same.
fn-mote 9 months ago

Clearly you know your stuff. Are there any not-super-technical references where an argument against using imputation is clearly explained?

karaterobot 9 months ago

Does anyone else find it maddeningly difficult to read Quanta articles on desktop, because the nav bar keeps dancing around the screen? One of my least favorite web design things is the "let's move the bar up and down the screen depending on what direction he's scrolling, that'll really mess with him." I promise I can find the nav bar on my own when I need it.

paulpauper 9 months ago

why not use regression on the existing entries to infer what the missing ones should be?

ivan_ah 9 months ago

That would push things towards the mean... not necessarily a bad thing, but presumably later steps of the analysis will be pooling/averaging data together so not that useful.
A more interesting approach, let's call it OPTION2, would be to sample from the predictive distribution of a regression (regression mean + noise), which would result in more variability in the imputations, although random so might not what you want.
The multiple imputation approach seems to be a resampling methods of obtaining OPTION2, w/o need to assume linear regression model.
- stdbrouw 9 months ago
  
  Multiple imputation simply means you impute multiple times and run the analysis on each complete (imputed) dataset so you can incorporate the uncertainty that comes from guessing at missing values into your final confidence intervals and such. How you actually do the imputation will depend on the type of variable, the amount of missingness etc. A draw from the predictive distribution of a linear model of other variables without missing data is definitely a common method, but in a state-of-the-art multiple imputation package like mi in R you can choose from dozens.

bgnn 9 months ago

Why not interpolate the missing data points with similar patients data?

This must be about the confidence of the approach. Maybe interpolation would be overconfident too.

SillyUsername 9 months ago

Isn't this just Monte Carlo, or did I miss something?

hatmatrix 9 months ago

Monte Carlo is one way to implement multiple imputation.

userbinator 9 months ago

It reminds me somewhat of dithering in signal processing.

a-dub 9 months ago

life is nothing but shaped noise