Bayes factors

class: left, bottom, my-title, title-slide

.title[
# Bayes factors
]
.subtitle[
## Who’s worried, who’s not, and why
]
.author[
### Richard D. Morey
]
.date[
### 29 March 2023
]

---

# An alternative universe

<center>
<img src="img/time-machine.jpg" style="width: 70%;"/><br/><br/>
A Bayesian discovers time travel
</center>

---

# Back to the future...

<center>
<img src="img/alt-rep-crisis2.png" style="width: 70%; border: 2px solid black; box-shadow: 5px 10px 18px #888888;"/><br/><br/>
Replication crisis!
</center>

---

# "Bayes factors overstate the evidence"

* "If the null is true, we can find easily find BF>10!"
  * "Sequential testing is standard"
  * "Opportunistic one-sided tests are standard"
  * "Multiple testing, without correction, is standard"
  * "Bayesian point nulls have been abandoned (implausible!)"
* "Solution: Frequentist statistics"

<br/>

<center>
<img src="img/fix-rep-crisis.png" style="width: 70%; border: 2px solid black; box-shadow: 5px 10px 18px #888888;"/><br/><br/>
"p values: the new statistic everyone is talking about"
</center>

---

# Bayes factors in our universe

* Generalization of likelihood ratios
* Increasingly popular statistics for testing
* Used by some to "calibrate" `$p$` values

### But...are *controversial* even among Bayesians.

---

# Opinions about Bayes factors

>"Bayes factors are the primary tool used in Bayesian inference for hypothesis testing and model selection...their role in Bayesian analysis is indisputable."

&mdash; [Berger (2006)](https://onlinelibrary.wiley.com/doi/book/10.1002/0471667196)

---

# Opinions about Bayes factors

> "...I see [the Bayes factor] as a child of its time, namely, as impacted by the on-going formalisation of testing by other pioneers like Jerzy Neyman or Egon Pearson. Returning a single quantity for the comparison of two models fits naturally in decision making, but I strongly feel in favour of the alternative route that Bayesian model comparison should abstain from automated and hard decision making."

&mdash; [Robert (2016)](https://www.sciencedirect.com/science/article/abs/pii/S0022249615000504)

---

# Opinions about Bayes factors

> "In our experience, [statistical practice] is about plots and predictive checks, not about Bayes factors or posterior probabilities of candidate models."

&mdash; Gelman [(2013)](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.2011.02037.x)

> "I generally hate Bayes factors myself..."

&mdash; Gelman [(2017)](https://statmodeling.stat.columbia.edu/2017/07/21/bayes-factor-term-came-references-generally-hate/)

---

# Bayesian<sup>*</sup> critiques of Bayes factors

* Sensitivity to prior model

* Boils down assessment to single value

* Often associated with model *selection*

* Often associated with (assumed) improbable point nulls

.footnote[<sup>*</sup>(Frequentists, of course, have access to all the usual frequentist critiques of Bayesian statistics.)]

---

# Statistical evidence

.pull-left[

### Frequentist evidence

* Evidence *against*
* Primitive: probability as "decision" error
* **Strong** when weaker evidence almost sure, if true
* Ideas: error control, sensitivity

]

.pull-right[

### Bayesian evidence

* *Weight/balance* of evidence 
* Primitive: probability as "beliefs"
* **Strong** when evidence shifts beliefs in one vs other
* Ideas: Convincingness, relative odds change

]

---

# What is a Bayes factor?

### A Bayes factor is a change in relative odds (belief) due to the data

$$
\frac{
p({\cal M}_1\mid\boldsymbol y)}{p({\cal M}_2\mid\boldsymbol y)} = \frac{p(\boldsymbol y\mid {\cal M}_1)}{p(\boldsymbol y\mid {\cal M}_2)}\times\frac{p({\cal M}_1)}{p({\cal M}_2)}
$$

<br/>

<div>
$$ 
\mbox{Posterior odds} = \mbox{Evidence} \times \mbox{Prior odds}\phantom{\int_{\boldsymbol\Theta}}
$$
</div>

"The Bayes factor is the shift in the odds due to the data."

---

# What is a Bayes factor?

### A Bayes factor is a change in relative odds (belief) due to the data

$$
\frac{
p({\cal M}_1\mid\boldsymbol y)}{p({\cal M}_2\mid\boldsymbol y)} = \frac{p(\boldsymbol y\mid {\cal M}_1)}{p(\boldsymbol y\mid {\cal M}_2)}\times\frac{p({\cal M}_1)}{p({\cal M}_2)}
$$

<br/>

<div>
$$ 
p(\boldsymbol y\mid{\cal M}_i) =
\int_{\boldsymbol\Theta} p_i(\boldsymbol y\mid\boldsymbol \theta)p_i(\boldsymbol \theta) d\boldsymbol \theta 
$$
</div>

"The Bayes factor is the ratio of the average likelihoods under the models."

---

# An example

### Extrasensory perception

* A "future-seer" ability is to be tested.
* Before each of 100 coin flips, they predict the outcome (heads or tails).
* Of interest:
    * `$y$`: Outcome, correct predictions
    * `$\theta$`: "True" probability correct
    * To test `$\theta=0.5$` vs `$\theta>.5$`
* Suppose they predict `$y=61$` correctly.

---

# Binomial probability

---

# Binomial probability

---

# Binomial probability

---

# Classical one-sided p value

---

# Another possibility

---

# A simple Bayes factor

---

# Another possibility

---

# Likelihood (and ratio)

---

# What do we want to compare?

.pull-left[

### Model 0 ("Null")

Model 0 is simple.

$$
{\cal M}_0: \theta = 0.5
$$

$$
p(y=61\mid{\cal M}_0) = 0.007
$$

]

.pull-right[

### Model 1 ("Alternative")

Model 1 is composite!

$$
{\cal M}_1: \theta > 0.5
$$

Need the *average* likelihood:

<div>
$$
p(y=61\mid{\cal M}_1) = \int_{.5}^1 p(y=61\mid\theta)p(\theta)d\theta
$$
</div>

]

---

# A prior distribution

---

# The average likelihood

---

# The average likelihood

---

# Computing a posterior: multiplying curves

---

# From prior to posterior

---

# From prior to posterior

---

# Computing a Bayes factor

.faded[
.pull-left[

### Model 0 ("Null")

Model 0 is simple.

$$
{\cal M}_0: \theta = 0.5
$$

$$
p(y=61\mid{\cal M}_0) = 0.007
$$
</div>

]
]
.pull-right[

### Model 1 ("Alternative")

Model 1 is composite!

$$
{\cal M}_1: \theta > 0.5
$$

Need the *average* likelihood:

<div>
$$
`\begin{eqnarray*}
p(y=61\mid{\cal M}_1) &=& \int_{.5}^1 p(y=61\mid\theta)p(\theta)d\theta\\
&=&0.039
\end{eqnarray*}`
$$
</div>

]

<center>
Bayes factor vs "null": `$0.039/0.007 = 5.5$`
</center>

---

# Interpreting the Bayes factor

### What does the value mean?

| Bayes factor | Jeffreys interpretation |
|--------------|-------------------------|
| 1-3          | "Not worth mentioning"  |
| 3-10         | "Substantial"           |
| 10-50        | "Strong"                |
| 50-100       | "Very strong"           |
| >100         | "Decisive"              |

<br/>

Who gets to decide this?

---

# Average likelihood for all outcomes

---

# Another prior

---

# Another prior: average likelihood

---

# One sided p values vs Bayes factors

---

# Prior "robustness"

---

# Problematic p values?

---

# Facts about Bayes factors

<br/>

* Can be computed whenever `$p(y\mid{\cal M})$` is available
* Sensitive to prior model `$p(\boldsymbol\theta)$`
* Insensitive to optional stopping
  * Bounded *under some conditions*
* Insensitive to prior odds...
* ...hence, insensitive to multiple tests

---

# Prior sensitivity?

### Possible reactions:

* Abandon 
  * Unprincipled?
* Accept
  * Is this possible?
* Choose "defaults"...
  * ...but lose arguments for Bayes
* Restrict to *classes* of priors
  * e.g.: "The Bayes factor will not exceed X..."
  * Doesn't solve the problem

---

# Prior sensitivity

Prior sensitivity in testing *isn't just about* the numbers being slightly different.

It's fundamentally about what questions we're trying to answer, and how.

---

# Redefine statistical significance (RS)?

---

# RS primary arguments:

> "We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries."

1. `$p\approx.05$` represents "weak [Bayesian] evidence"
2. `$p<.05$` inflates "false discovery rate"
3. `$p<.05$` results less likely to "replicate" than `$p<.005$`

➡ Using `$p<.05$` "leading cause of non-reproducibility"
 
---

## Priors/models

---

# "Calibrating" p values

### The argument

* In fixed `$N$` designs,
* ...with some default ("reasonable") prior...
* ...or any prior in a ("reasonable") class of priors...
* ...comparing a (two sided) alternative...
* ...against a point null... 
* ...the Bayes factor can be "small"...
* ...while the `$p$` value is `$<0.05$`.

---

# RS Bayes factor approaches

Point null vs two sided alternatives:

### Likelihood

* ...best point alternatives (maximum likelihood)

### Johnson

* ...point alternatives with specified "power"

### Berger

* ...best case among class of priors

---

# What are we testing?

|  | Test                 | Bayes factor        |
|---|-------------------------|---------------------|
| Point vs 1-sided  | >.5, vs null | 5.5         |
| Dividing | >.5, vs &leq; .5 | <sup>&#8224;</sup>32.4  |
| Point vs 2-sided | >.5 OR <.5, vs null | <sup>&#8224;&#8224;</sup>2.8   |

<br/>

| | Test                 | p value       |
|--|-------------------------|---------------------|
| Point vs 1-sided | <.5, vs null | 0.018         |
| Dividing | .5, vs below .5 | 0.018  |
| Point vs 2-sided | Above OR below .5, vs null | 0.035   |

.footnote[<sup>&#8224;</sup> Assuming prior for values <.5 are the mirror images of those >.5.<br/><sup>&#8224;&#8224;</sup>Assuming equal weights on each.]

---

# Penalizing the evidence

How do these practices affect the evidence?

| Practice      | p value | Bayes factor |
|---------------|-----------|-----------------|
| Choosing sign       | .red[penalized] | no effect
| Data peeking / sequential designs | .red[penalized] | no effect |
| Multiple hypotheses | .red[penalized] | no effect |

* The dividing Bayes factor has *unlimited* Bayes factor under sequential sampling *even under the null*.
* The `$p$` value would be penalized.

---

# What about those RS arguments?

### Argument's assumptions not essential to Bayes factors.

* Assume fixed designs
  * Won't reveal when BFs appear to "overstate" evidence
* Point null
  * Reduces evidence for alternative
* Two-sided priors: 
  * Reduces evidence for alternative

---

# Arguments for Bayes factors undercut RS!

<br/>

* "Dependence on prior is good flexibility, not a problem"
  * Then, no *general* calibration against p values
* "Multiple testing can be handled with prior odds"
  * But, it is always argued the prior odds don't matter
* "p values are sensitive to optional stopping, BFs aren't"
  * Then the argument should reflect that!

---

# Wrap-up

* Bayesian factors: Statistical evidence, from a Bayesian perspective
* Bayesian arguments cuts two ways:
  * Flexibility: No general `$p$`/BF calibration 
  * Insensitivity: BF can "overstate" relative to `$p$`

> "I do deal with likelihood function and, occasionally, calculate the maximum likelihood estimators. However, I do so not as a matter of principle, but only in those cases when the frequency properties of the estimators fit my purposes."

&mdash; Neyman, 1977

---

# My own thoughts

* Frequentist/Bayesian are useless labels
  * More variance within than between!
* A good statistician's intuitions are "Bayesian" and "frequentist"
  * How often could I be misled in this idealized situation?
  * How would idealized beliefs change?
* Reducing crime by making it legal is a questionable approach.