## Interpretations of Probability

• 1.9k
This is a response to @Jeremiah from the 'Chance, Is It Real?' thread.

Pre-amble: I'm going to assume that someone reading it knows roughly what an 'asymptotic argument' is in statistics. I will also gloss over the technical specifics of estimating things in Bayesian statistics, instead trying to suggest their general properties in an intuitive manner. However, it is impossible to discuss the distinction between Bayesian and frequentist inference, so it is unlikely that someone without a basic knowledge of statistics will understand this post fully.

Reveal rough definition of asymptotic argument
If not, an asymptotic argument is a mathematical derivation that assumes we have an infinite (or 'sufficiently large' = infinity) sample size or an infinite number of repeated experiments of exactly the same set up.

In contemporary statistics, there are two dominant interpretations of probability.

1) That probability is always proportional to the long-term frequency of a specified event.
2) That probability is the quantification of uncertainty about the value of a parameter in a statistical model.

(1) is usually called the 'frequentist interpretation of probability', (2) is usually called the 'Bayesian interpretation of probability', though there are others. Each of this philosophical positions has numerous consequences for how data is analysed. I will begin with a brief history of the two viewpoints.

The frequentist idea of probability can trace its origin to Ronald Fisher, who gained his reputation in part through analysis of genetics in terms of probability - being a founding father of modern population genetics, and in part through the design and analysis of comparative experiments - developing the analysis of variance (ANOVA) method for their analysis. I will focus on the developments resulting from the latter, eliding technical detail. Bayesian statistics is named after Thomas Bayes, the discoverer of Bayes Theorem', which arose in analysing games of chance. More technical details are provided later in the post. Suffice to say now that Bayes Theorem is the driving force behind Bayesian statistics, and this has a quantity in it called the prior distribution - whose interpretation is incompatible with frequentist statistics.

The ANOVA is an incredibly commonplace method of analysis in applications today. It allows experimenters to ask questions relating to the variation of a quantitive observations over a set of categorical experimental conditions.

For example, in agricultural field experiments 'Which of these fertilisers is the best?'

The application of fertilisers is termed a 'treatment factor', say there are 2 fertilisers called 'Melba' and 'Croppa', then the 'treatment factor' has two levels (values it can take), 'Melba' and 'Croppa'. Assume we have one field treated with Melba, and one with Croppa. Each field is divided into (say) 10 units, and after the crops are fully grown, the total mass of vegetation in each unit will be recorded. An ANOVA allows us to (try to) answer the question 'Is Croppa better than Melba?. This is done by assessing the mean of the vegetation mass for each field and comparing these with the observed variation in the masses. Roughly: if the difference in masses for Croppa and Melba (Croppa-Melba) is large compared to how variable the masses are, we can say there is evidence that Croppa is better than Melba.* How?

This is done by means of a hypothesis test. At this point we depart from Fisher's original formulation and move to the more modern developments by Neyman and Pearson (which is now the industry standard). A hypothesis test is a procedure to take a statistic like 'the difference between Croppa and Melba' and assign a probability to it. This probability is obtained by assuming a base experimental condition, called 'the null hypothesis', several 'modelling assumptions' and an asymptotic argument .

In the case of this ANOVA, these are roughly:

A) Modelling assumptions: variation between treatments only manifests as variations in means, any measurement imprecision is distributed Normally (a bell curve).
B) Null hypothesis: There is no difference in mean yields between Croppa and Melba
C) Asymptotic argument: assume that B is true, then what is the probability of observing the difference in yields in the experiment assuming we have an infinitely large sample or infinitely many repeated samples? We can find this through the use of the Normal distribution (or more specifically for ANOVAS, a derived F distribution, but this specificity doesn't matter).

The combination of B and C is called a hypothesis test.

The frequentist interpretation of probability is used in C. This is because a probability is assigned to the observed difference by calculating on the basis of 'what if we had an infinite sample size or infinitely many repeated experiments of the same sort?' and the derived distribution for the problem (what defines the randomness in the model).

An alternative method of analysis, in a Bayesian analysis would allow the same modelling assumptions (A), but would base its conclusions on the following method:

A) the same as before
B) Define what is called a prior distribution on the error variance.
C) Fit the model using Bayes Theorem.
D) Calculate the odds ratio of the statement 'Croppa is better than Melba' to 'Croppa is worse than or equal to Melba' using the derived model.

I will elide the specifics of fitting a model using Bayes Theorem. Instead I will provide a rough sketch of a general procedure for doing so below. It is more technical, but still only a sketch to provide an approximate idea.

Bayes theorem says that for two events A and B and a probability evaluation P:
P(A|B) = P(B|A)P(A) / P(B)
where P(A|B) is the probability that A happens given that B has already happened, the conditional probability of A given B. If we also allow P(B|A) to depend on the data X, we can obtain P(A|B,X), which is called the posterior distribution of A.

For our model, we would have P(B|A) be the likelihood as obtained in frequentist statistics (modelling assumptions), in this case a normal likelihood given the parameter A = the noise variance of the difference between the two quantities. And P(A) is a distribution the analyst specifies without reference to the specific values obtained in the data, supposed to quantify the a priori uncertainty about the noise variance of the difference between Croppa and Melba. P(B) is simply a normalising constant to ensure that P(A|B) is indeed a probability distribution.

Bayesian inference instead replaces the assumptions B and C with something called the prior distribution and likelihood, Bayes Theorem and a likelihood ratio test. The prior distribution for the ANOVA is a guesstimate of how variable the measurements are without looking at the data (again, approximate idea, there is a huge literature on this). This guess is a probability distribution over all the values that are sensible for the measurement variability. This whole distribution is called the prior distribution for the measurement variability. It is then combined with the modelling assumptions to produce a distribution called the 'posterior distribution', which plays the same role in inference as modelling assumptions and the null hypothesis in the frequentist analysis. This is because posterior distribution then allows you to produce estimates of how likely the hypothesis 'Croppa is better than Melba' is compared to 'Croppa is worse than or equal to Melba', that is called an odds ratio.

The take home message is that in a frequentist hypothesis test - we are trying to infer upon the unknown fixed value of a population parameter (the difference between Croppa and Melba means), in Bayesian inference we are trying to infer on the posterior distribution of the parameters of interest (the difference between Croppa and Melba mean weights and the measurement variability). Furthermore, the assignment of an odds ratio in Bayesian statistics does not have to depend on an asymptotic argument relating the null hypothesis and alternative hypothesis to the modelling assumptions. Also, it is impossible to specify a prior distribution through frequentist means (it does not represent the long run frequency of any event, nor an observation of it).

Without arguing which is better, this should hopefully clear up (to some degree) my disagreement with @Jeremiah and perhaps provide something interesting to think about for the mathematically inclined.
• 1.9k
Summary:

Frequentist - fixed population parameters, hypothesis tests, asymptotic arguments.
Bayesian - random population parameters, likelihood ratio tests, possible non-reliance on asymptotic arguments.

Major disagreement: interpretation of the prior probability.
• 1.9k
One last thing I forgot: you can see that there was no mention of the measure theoretic definition of probability in the above posts. This is because the disagreement between frequentist and Bayesian interpretations of probability is fundamentally one of parameter estimation. You need the whole machinery of measure theoretic probability to get rigorously to this part of statistical analysis, so the bifurcation between the interpretations occurs after it.
• 7.6k
Does it really matter? You can always just accept both, and specify whether you're stating the frequentist or the Bayesian probability.
• 1.9k
If you're a strict Bayesian the vast majority of applied research is bogus (since it uses hypothesis tests). If you're a strict frequentist you don't have the conceptual machinery to deal with lots of problematic data types, and can't use things like Google's tailored search algorithm or Amazon's recommendations.

In practice people generally think the use of both is acceptable.
• 712
There's also a very good introduction to interpretations of probability in the SEP article Interpretations of Probability. (Though probably biased towards Bayesianism, the author's preference.)

I think that practical differences between frequentism and Bayesianism are overstated. With careful analysis one method can usually be translated into the other.
• 1.9k

It's probably true that for every Bayesian method of analysis there's a similar non-Bayesian one which deals with the same problem, or estimates the same model. I think the differences between interpretations of probability arise because pre-theoretic intuitions of probability are an amalgamation of several conflicting aspects. For example, judging a dice to be fair because its centre of mass is in the middle and that the sides are the same area vs judging a dice to be fair because of many rolls vs judging a dice to be fair because loaded dice are uncommon; each of these is an ascription of probability in distinct and non-compatible ways. In order, objective properties of the system: 'propensity' in the language of the SEP article; because the observed frequency of sides is consistent with fairness, 'long term frequency' in the SEP article; and an intuitive judgement about the uncommonness of loaded dice, 'subjective probability' in the SEP article.

An elision from the article is that the principle of indifference; part claim that 'equally possibles are equally probables' and part claim that randomness is derived from a priori equipotentiality actually place constraints on what probability measures give rise to random variables consistent with a profound lack of knowledge of their typical values. In what sense? If the principle of indifference is used to represent lack of knowledge in arbitrary probability spaces, we could only use a probability measure proportional to the Lebesgue measure in those spaces (or counting measure for discrete spaces). IE, we can only use the uniform distribution. This paper gives a detailed treatment of the problems with this entailment.

I think it's worth noting that pre-theoretic ideas of probability don't neatly correspond to a single concept, whereas all authors in the SEP article having positions on these subjects agree on what random variables an probability measures are (up to disagreements in axiomatisation). It is a bizarre situation in which everyone agrees on the mathematics of random variables and probability to a large degree, but there is much disagreement on what the ascription of probability to these objects means [despite being a part of the mathematical treatment]. I believe this is a result of the pre-theoretic notion of probability that we have being internally conflicted - bringing various non equivalent regimes of ideas together as I detailed above with the examples.

Another point in criticism of the article is that one of the first contributors to Bayesian treatments of probability, Jeffreys, advocates the use of non-probability distributions as prior distributions in Bayesian analysis [for variance parameters] - ones which cannot represent the degree of belief of a subject and do not obey the constraints of a probability calculus. The philosophical impact of the practice of statistics no longer depending in some sense of dealing with exact distributions and likelihoods, to my knowledge, is small to nonexistent. I believe an encounter between contemporary statistics research - especially in the fields of prior specification and penalised regression - would definitely be valuable and perturbing to interpretations of probability.

Lastly, it makes no mention of the research done in psychology about how subjects actually make probability and quantitative judgements - it is very clear that preferences stated about the outcomes of quantitative phenomena are not attained through the use of some utility calculus and not even through Bayes theorem. If there is a desire to find out how humans do think about probability and form beliefs in the presence of uncertainty rather than how we should in an idealised betting room, this would be a valuable encounter too.
• 712
Yes, priors, their choice and justification are a vexed issue for Bayesianism, so much so that some would rather not deal with them at all (e.g. "likelihoodism" of Fitelson and Sober), or at least eschew ignorance priors (e.g. Norton). But to Bayesianism's credit, it at least makes the issue explicit, whereas frequentism kind of sweeps it under the rug.

Good point about psychology as well. Orthodox Bayesianism is usually justified by Dutch book arguments or similar, which presuppose some highly idealized rationally calculating agent. It is often said that people's intuition is crap at dealing with probabilities. This sentiment, no doubt, sets that kind of rational probability as the standard for comparison. But wasn't the very idea of "subjective" probability to take our psychological intuitions as the primary source of probability valuations? There seem to be conflicting agendas here. But on the other hand, if we give up the simplistic rationalism of Bayes, won't we then diverge from scientific (not to mention mathematical) probability, carving out a special theory that's only relevant to psychology?
• 1.9k
@'SophistiCat'
I've not met likelihoodism before, do you have any good references for me to read? Will respond more fully once I've got some more familiarity with this.
• 712
I don't have anything on hand, and I cited Sober and Fitelson (his onetime student I think) from memory. But if you google likelihoodism you'll readily find some texts.
• 1.9k
I read a few things on likelihoodism and other ideas of what is the 'right way' to show that data favours a hypothesis against a (set of) competing hypothesis. They could be summarised by the following:

Data $X$ supports hypothesis $H_1$ over $H_2$ iff some contrasting function $d$ of the posterior or likelihood given the data and the two hypothesis is greater than a specified value [0 for differences, 1 for ratios]. Contrasting functions could include ratios of posterior odds, ratios of posterior to prior odds, the raw likelihood ratios, for example.

This really doesn't generalise very well to statistical practice. I'm sure there are more. I'll start with non-Bayesian problems:

1) Using something like a random forest regression doesn't allow you to test hypotheses in these ways, and will not output a likelihood ratio or give you the ability to derive one.
2) Models are usually relationally compared with fit criteria such as the AIC, BIC or DIC. These involve the likelihood but are also functions of the number of model parameters.
3) Likelihoodism is silent on using post-diagnostics for model comparison.
4) You couldn't look at the significance of an overdispersion parameter in Poisson models since this changes the likelihood to a formally distinct function called a quasi-likelihood.
5) It is silent on the literature using penalized regression or loss functions more generally for model comparisons.

Bayesian problems:

1) Bayes factors (a popular $d$) from above) are riddled with practical problems, most people wanting to test the 'significance' of a hypothesis instead see if a test parameter value belongs to a 95% posterior quantile credible interval [the thing people want to do with confidence intervals but can't].
2) It completely elides the use of posterior predictive checks for calibrating prior distributions.
3) It completely elides the use of regularisation and shrinkage: something like the prior odds ratio of two half Cauchies or t-distribution on 2 degrees of freedom wouldn't mean much [representing no hypothesis other than the implicit 'the variance is unlikely to be large', which is the hypothesis in BOTH numerator and denominator of the prior ratio].

The most damning thing is really its complete incompatibility with posterior diagnostics or model fitting checks in choosing models. They're functions of the model independent of its specifications and are used for relational comparison of evidence.
• 1.9k
@Sophisticat

There's a big probability I'm being unfair based on unfamiliarity with the literature, just whenever I've read philosophy of statistics it is usually concerned with things very separate from the current practice of statistics - especially statistical modelling. Still, if you notice any ways I'm being unfair I'd like to hear them.

I forgot to reply to this:

But wasn't the very idea of "subjective" probability to take our psychological intuitions as the primary source of probability valuations? There seem to be conflicting agendas here. But on the other hand, if we give up the simplistic rationalism of Bayes, won't we then diverge from scientific (not to mention mathematical) probability, carving out a special theory that's only relevant to psychology? — Sophisticat

There's been an attempt to assess the consequences of giving up the rational utility maximisers/probabilistic rationality since the 70's, following behavioural economics and the experimental psychology behind it. A landmark paper in this regard is 'Prospect Theory' by Kahnneman and Tversky.

Our intuitions being a primary source of probability evaluations is quite vexed (as you put it), since our intuitions demonstrably contain no untrained competence in evaluating phenomena subject to regression to the mean and sample size effects, also not Bayes theorem. This isn't too surprising, as any field of science doubtlessly has many phenomena which will not have untrained competence regarding them.

In my view, if there is a conflict of the intuition with something that is already unambiguously formalised, go with the formalisation.
• 712
I read a few things on likelihoodism and other ideas of what is the 'right way' to show that data favours a hypothesis against a (set of) competing hypothesis.

I am sorry, my statistics and hypothesis testing background is too basic and rusty to fully appreciate your comments. I didn't mean to advocate likelyhoodism though - I only mentioned it as an example of Bayesians not being satisfied with prior probabilities and seeking ways to avoid them while still preserving what they think are Bayesianism's advantages.

In my view, if there is a conflict of the intuition with something that is already unambiguously formalised, go with the formalisation.

While Bayesianism may be an inadequate model of human cognition in every respect, or even in most respects, it may still be a passable approximation on the whole, and a good local approximation, in an asymptotic sense. AFAIK Bayesian models have shown some promise in cognitive sceince and neuroscience, and of course they have been widely used in machine learning - although the latter cannot be considered as strong evidence in its favor, since there's still a lot of debate as to weather neural network AI approaches are on the right track.
• 1.9k

I am sorry, my statistics and hypothesis testing background is too basic and rusty to fully appreciate your comments. I didn't mean to advocate likelyhoodism though - I only mentioned it as an example of Bayesians not being satisfied with prior probabilities and seeking ways to avoid them while still preserving what they think are Bayesianism's advantages. — Sophisticat

The thrust of the comments is that contemporary statistics uses plenty of methods and mathematical objects that are not consistent with contemporary philosophy of statistics' accounts of evidential content and the methods and objects used to analyse it. One response would be 'so much the worse for statistics', but I think it's so much the worse for philosophy of statistics since these methods observably work.

I think whether Bayesian models of the mind or of learning in general are accurate in principle is mostly orthogonal to interpretations of probability. Would be worth another thread though.
• 3.1k
Without arguing which is better, this should hopefully clear up (to some degree) my disagreement with Jeremiah and perhaps provide something interesting to think about for the mathematically inclined.

Well presented. I've spent some time trying to understand and use what you call the frequentist interpretation. It's pretty consistent with my everyday understanding of how likely something is to happen. I've done a little reading on Bayesian probabilities also, but not enough to really get a feel for it.

In what types of situations would you use one rather than the other?
• 1.9k
There's usually a Bayesian or frequentist method to do anything. The major reasons people choose to use Bayes or frequentist afaik is pragmatic, more to do with the availability of software and the speed of algorithms than anything philosophically fundamental.

I can say I wouldn't use frequentist estimates for problems with spatial dependence though - they take a long time algorithmically -, whereas there are very efficient Bayesian methods for it.
• 712
The thrust of the comments is that contemporary statistics uses plenty of methods and mathematical objects that are not consistent with contemporary philosophy of statistics' accounts of evidential content and the methods and objects used to analyse it. One response would be 'so much the worse for statistics', but I think it's so much the worse for philosophy of statistics since these methods observably work.

If philosophers are not current with their subject, I would say so much the worse for philosophers. I can only hope that things aren't quite as bad as you say.

I think whether Bayesian models of the mind or of learning in general are accurate in principle is mostly orthogonal to interpretations of probability. Would be worth another thread though.

Well, isn't the entire thrust of the Bayesian (aka epistemic) interpretation to psychologize probability?
• 7.1k
I think the discrepancy in interpretations lends itself due to the possibility of hidden variables. Is this something that is considered in probability theory because it goes to the heart of the issue in my opinion?
• 1.9k
@SophistiCat

Well, isn't the entire thrust of the Bayesian (aka epistemic) interpretation to psychologize probability?

I think so. But I don't think this accounts for whether Bayesian approaches to AI and the mind are correct or not. In my view AI questions about Bayesian methods are 'does this statistical model learn in the same way humans do?' or 'is this statistical model something like what a conscious mind would do?', but epistemic questions are 'does this interpretation of probability make sense of how probability is used?' and 'does (list of properties of Bayesian inference) give a good normative account of how we ought to reason?'. I can certainly see why AI questions would influence epistemic questions and vice versa, but there are definitely significant problems that affect one and not the other. For example, arguments about the likelihood principle (evidential claims must depend on a likelihood in some manner) are largely epistemic, but arguments about the framing problem (problems of parametrisation in learning algorithms) largely concern AI.

@Posty McPostface

I think the discrepancy in interpretations lends itself due to the possibility of hidden variables. Is this something that is considered in probability theory because it goes to the heart of the issue in my opinion?

I don't know what you mean, can you throw some more words at me please?
• 2k
If you're a strict Bayesian the vast majority of applied research is bogus (since it uses hypothesis tests).
Can you elaborate a little please? Are you suggesting that hypothesis tests are always invalid under a strict Bayesian approach, or only that the vast majority of them are?

I ask the question because I am thinking about a hypothesis that the proportion of red balls in a large barrel containing a finite number of identically-sized red or green balls is no greater than 0.1, and then using a sample of say thirty balls to test that hypothesis and make a statement about the confidence that the hypothesis is correct. It's not clear to me that that exercise requires one to choose between Bayesian and Frequentist interpretations in order to be valid.
• 1.9k
The typical method used for hypothesis testing using p-values is invalid if you are strictly Bayesian.
• 2k
I'm afraid don't know what you mean by typical method. If I were to use binomial distributions to calculate a level of confidence (a p-value) that the proportion of red balls in the above case is no greater than 0.1, based on an observed sample of thirty balls, would you call that typical? If so, why would a strict Bayesian consider it invalid?
• 1.9k
The procedure for going from a null hypothesis and an alternative hypothesis to a p-value with its usual interpretation is what being a strict Bayesian precludes. There are a couple of direct contradictions, firstly that the population parameter in the null is fixed - in Bayes it's random. Also the interpretation of a p value typically (as is the case for t,Z,F and Chi-square tests) relies on the frequentist interpretation of probability - long run frequency.
• 2k
This is not consistent with my understanding. But perhaps the apparent conflict lies in words like 'usual interpretation'.

If I have a null hypothesis that a population parameter mu has value q and calculate that, conditional on the null hypothesis being true, the probability of statistic S from a randomly chosen sample of n elements having a value in set A is p, are you saying that making that statement is inconsistent with a Bayesian view?

Or are you saying that to go from there to a statement about the probability that the population parameter mu lies in some set U is inconsistent with a Bayesian view?

I am generally uncomfortable with, and tend to avoid, statements of the second kind, so I expect we probably agree, if that's what you meant.
• 1.9k

If I have a null hypothesis that a population parameter mu has value q and calculate that, conditional on the null hypothesis being true, the probability of statistic S from a randomly chosen sample of n elements having a value in set A is p, are you saying that making that statement is inconsistent with a Bayesian view?

You can look at the probability of a random variable belonging to a set, you can't look at the probability of a fixed quantity belonging to a set (it's 0 or 1 depending on the interval and value). Also, the p-value assumes that the null hypothesis is true, it isn't a probability estimate of the null hypothesis. The important differences are what is considered random and how it's dealt with.

It's also possible to define p-value like objects for Bayesian analysis, but they are based on random parameters - so the likelihood actually is a probability distribution every time. This observation can also give an inconsistency between Bayesians and frequentists, and is termed a likelihood principle. Lots of hypothesis tests don't satisfy this principle.
• 1.9k
Another contrast I thought of is that in Bayesian methods, the data are considered as fixed quantities in the likelihood and hypothesis tests are done using the product of the likelihood (actually a conditional distribution of the model parameters given the data here) and the prior - the posterior. There isn't an asymptotic distribution of the test statistic, there's simply some ratio involving priors and likelihoods.
• 712
I think so. But I don't think this accounts for whether Bayesian approaches to AI and the mind are correct or not. In my view AI questions about Bayesian methods are 'does this statistical model learn in the same way humans do?' or 'is this statistical model something like what a conscious mind would do?', but epistemic questions are 'does this interpretation of probability make sense of how probability is used?' and 'does (list of properties of Bayesian inference) give a good normative account of how we ought to reason?'.

Well, if Bayesian probability is supposed to model our reasoning, then there is an obvious connection between Bayesian models and AI, if the idea is for AI to emulate human reasoning.

But does Bayesian probability describe reasoning or prescribe reasoning? It seems to want to do both.
• 1.9k

I haven't got a scooby.
• 7.6k
Let's say I flip a fair coin and randomly select a card from an ordinary deck. I tell you truthfully that either the coin landed heads up or that the card is the Ace of Spades. If you can guess correctly which it is then I will give you £10.

From what I recall from past arguments (with @aletheist, I believe), the frequentist can only say that the probability that the coin landed heads up is either 0 or 1 and that the probability that the card is the Ace of Spades is either 0 or 1. Therefore, it cannot help you to determine which is the better guess.

However, the Bayesian can say that the probability that the coin landed heads up is 1/2 and that the probability that the card is the Ace of Spades is 1/52, showing that it is better to guess that the coin landed heads up.
bold
italic
underline
strike
code
quote
ulist
image
url
mention
reveal