class: left, bottom, my-title, title-slide .title[ # Bayes factors ] .subtitle[ ## Who’s worried, who’s not, and why ] .author[ ### Richard D. Morey ] .date[ ### 29 March 2023 ] --- <!--script src="https://hypothes.is/embed.js" async></script--> <style> .faded{ opacity: .3; } .red{ color: red; } </style> # An alternative universe <center> <img src="img/time-machine.jpg" style="width: 70%;"/><br/><br/> A Bayesian discovers time travel </center> --- # Back to the future... <center> <img src="img/alt-rep-crisis2.png" style="width: 70%; border: 2px solid black; box-shadow: 5px 10px 18px #888888;"/><br/><br/> Replication crisis! </center> --- # "Bayes factors overstate the evidence" * "If the null is true, we can find easily find BF>10!" * "Sequential testing is standard" * "Opportunistic one-sided tests are standard" * "Multiple testing, without correction, is standard" * "Bayesian point nulls have been abandoned (implausible!)" * "Solution: Frequentist statistics" <br/> <center> <img src="img/fix-rep-crisis.png" style="width: 70%; border: 2px solid black; box-shadow: 5px 10px 18px #888888;"/><br/><br/> "p values: the new statistic everyone is talking about" </center> --- # Bayes factors in our universe <br/><br/> * Generalization of likelihood ratios * Increasingly popular statistics for testing * Used by some to "calibrate" `\(p\)` values ### But...are *controversial* even among Bayesians. --- # Opinions about Bayes factors >"Bayes factors are the primary tool used in Bayesian inference for hypothesis testing and model selection...their role in Bayesian analysis is indisputable." — [Berger (2006)](https://onlinelibrary.wiley.com/doi/book/10.1002/0471667196) --- # Opinions about Bayes factors > "...I see [the Bayes factor] as a child of its time, namely, as impacted by the on-going formalisation of testing by other pioneers like Jerzy Neyman or Egon Pearson. Returning a single quantity for the comparison of two models fits naturally in decision making, but I strongly feel in favour of the alternative route that Bayesian model comparison should abstain from automated and hard decision making." — [Robert (2016)](https://www.sciencedirect.com/science/article/abs/pii/S0022249615000504) --- # Opinions about Bayes factors > "In our experience, [statistical practice] is about plots and predictive checks, not about Bayes factors or posterior probabilities of candidate models." — Gelman [(2013)](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.2011.02037.x) > "I generally hate Bayes factors myself..." — Gelman [(2017)](https://statmodeling.stat.columbia.edu/2017/07/21/bayes-factor-term-came-references-generally-hate/) --- # Bayesian<sup>*</sup> critiques of Bayes factors * Sensitivity to prior model * Boils down assessment to single value * Often associated with model *selection* * Often associated with (assumed) improbable point nulls .footnote[<sup>*</sup>(Frequentists, of course, have access to all the usual frequentist critiques of Bayesian statistics.)] --- # Statistical evidence .pull-left[ ### Frequentist evidence * Evidence *against* * Primitive: probability as "decision" error * **Strong** when weaker evidence almost sure, if true * Ideas: error control, sensitivity ] .pull-right[ ### Bayesian evidence * *Weight/balance* of evidence * Primitive: probability as "beliefs" * **Strong** when evidence shifts beliefs in one vs other * Ideas: Convincingness, relative odds change ] --- # What is a Bayes factor? ### A Bayes factor is a change in relative odds (belief) due to the data $$ \frac{ p({\cal M}_1\mid\boldsymbol y)}{p({\cal M}_2\mid\boldsymbol y)} = \frac{p(\boldsymbol y\mid {\cal M}_1)}{p(\boldsymbol y\mid {\cal M}_2)}\times\frac{p({\cal M}_1)}{p({\cal M}_2)} $$ <br/> <div> $$ \mbox{Posterior odds} = \mbox{Evidence} \times \mbox{Prior odds}\phantom{\int_{\boldsymbol\Theta}} $$ </div> <br/> <br/> "The Bayes factor is the shift in the odds due to the data." --- # What is a Bayes factor? ### A Bayes factor is a change in relative odds (belief) due to the data $$ \frac{ p({\cal M}_1\mid\boldsymbol y)}{p({\cal M}_2\mid\boldsymbol y)} = \frac{p(\boldsymbol y\mid {\cal M}_1)}{p(\boldsymbol y\mid {\cal M}_2)}\times\frac{p({\cal M}_1)}{p({\cal M}_2)} $$ <br/> <div> $$ p(\boldsymbol y\mid{\cal M}_i) = \int_{\boldsymbol\Theta} p_i(\boldsymbol y\mid\boldsymbol \theta)p_i(\boldsymbol \theta) d\boldsymbol \theta $$ </div> <br/> <br/> "The Bayes factor is the ratio of the average likelihoods under the models." --- # An example ### Extrasensory perception <br/> <br/> * A "future-seer" ability is to be tested. * Before each of 100 coin flips, they predict the outcome (heads or tails). * Of interest: * `\(y\)`: Outcome, correct predictions * `\(\theta\)`: "True" probability correct * To test `\(\theta=0.5\)` vs `\(\theta>.5\)` * Suppose they predict `\(y=61\)` correctly. --- # Binomial probability <img src="index_files/figure-html/unnamed-chunk-2-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Binomial probability <img src="index_files/figure-html/unnamed-chunk-3-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Binomial probability <img src="index_files/figure-html/unnamed-chunk-4-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Classical one-sided p value <img src="index_files/figure-html/unnamed-chunk-5-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Another possibility <img src="index_files/figure-html/unnamed-chunk-6-1.svg" width="75%" style="display: block; margin: auto;" /> --- # A simple Bayes factor <img src="index_files/figure-html/unnamed-chunk-7-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Another possibility <img src="index_files/figure-html/unnamed-chunk-8-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Likelihood (and ratio) <img src="index_files/figure-html/unnamed-chunk-9-1.svg" width="75%" style="display: block; margin: auto;" /> --- # What do we want to compare? .pull-left[ ### Model 0 ("Null") Model 0 is simple. $$ {\cal M}_0: \theta = 0.5 $$ $$ p(y=61\mid{\cal M}_0) = 0.007 $$ ] .pull-right[ ### Model 1 ("Alternative") Model 1 is composite! $$ {\cal M}_1: \theta > 0.5 $$ Need the *average* likelihood: <div> $$ p(y=61\mid{\cal M}_1) = \int_{.5}^1 p(y=61\mid\theta)p(\theta)d\theta $$ </div> ] --- # A prior distribution <img src="index_files/figure-html/unnamed-chunk-11-1.svg" width="75%" style="display: block; margin: auto;" /> --- # The average likelihood <img src="index_files/figure-html/unnamed-chunk-12-1.svg" width="75%" style="display: block; margin: auto;" /> --- # The average likelihood <img src="index_files/figure-html/unnamed-chunk-13-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Computing a posterior: multiplying curves <img src="index_files/figure-html/unnamed-chunk-14-1.svg" width="95%" style="display: block; margin: auto;" /> --- # From prior to posterior <img src="index_files/figure-html/unnamed-chunk-15-1.svg" width="75%" style="display: block; margin: auto;" /> --- # From prior to posterior <img src="index_files/figure-html/unnamed-chunk-16-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Computing a Bayes factor .faded[ .pull-left[ ### Model 0 ("Null") Model 0 is simple. $$ {\cal M}_0: \theta = 0.5 $$ $$ p(y=61\mid{\cal M}_0) = 0.007 $$ </div> ] ] .pull-right[ ### Model 1 ("Alternative") Model 1 is composite! $$ {\cal M}_1: \theta > 0.5 $$ Need the *average* likelihood: <div> $$ `\begin{eqnarray*} p(y=61\mid{\cal M}_1) &=& \int_{.5}^1 p(y=61\mid\theta)p(\theta)d\theta\\ &=&0.039 \end{eqnarray*}` $$ </div> ] <center> Bayes factor vs "null": `\(0.039/0.007 = 5.5\)` </center> --- # Interpreting the Bayes factor ### What does the value mean? | Bayes factor | Jeffreys interpretation | |--------------|-------------------------| | 1-3 | "Not worth mentioning" | | 3-10 | "Substantial" | | 10-50 | "Strong" | | 50-100 | "Very strong" | | >100 | "Decisive" | <br/> Who gets to decide this? --- # Average likelihood for all outcomes <img src="index_files/figure-html/unnamed-chunk-18-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Another prior <img src="index_files/figure-html/unnamed-chunk-19-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Another prior: average likelihood <img src="index_files/figure-html/unnamed-chunk-20-1.svg" width="75%" style="display: block; margin: auto;" /> --- # One sided p values vs Bayes factors <img src="index_files/figure-html/unnamed-chunk-22-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Prior "robustness" <img src="index_files/figure-html/unnamed-chunk-23-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Problematic p values? <img src="index_files/figure-html/unnamed-chunk-24-1.svg" width="75%" style="display: block; margin: auto;" /> --- # Facts about Bayes factors <br/> * Can be computed whenever `\(p(y\mid{\cal M})\)` is available * Sensitive to prior model `\(p(\boldsymbol\theta)\)` * Insensitive to optional stopping * Bounded *under some conditions* * Insensitive to prior odds... * ...hence, insensitive to multiple tests --- # Prior sensitivity? ### Possible reactions: * Abandon * Unprincipled? * Accept * Is this possible? * Choose "defaults"... * ...but lose arguments for Bayes * Restrict to *classes* of priors * e.g.: "The Bayes factor will not exceed X..." * Doesn't solve the problem --- # Prior sensitivity <br/> <br/> Prior sensitivity in testing *isn't just about* the numbers being slightly different. <br/> <br/> It's fundamentally about what questions we're trying to answer, and how. --- # Redefine statistical significance (RS)? <center> <img src="img/RSS_2017.png" style="width: 45%;"/> </center> --- # RS primary arguments: > "We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries." 1. `\(p\approx.05\)` represents "weak [Bayesian] evidence" 2. `\(p<.05\)` inflates "false discovery rate" 3. `\(p<.05\)` results less likely to "replicate" than `\(p<.005\)` ➡ Using `\(p<.05\)` "leading cause of non-reproducibility" --- ## Priors/models <center> <img src="index_files/figure-html/unnamed-chunk-25-1.svg" width="60%" style="display: block; margin: auto;" /> </center> --- # "Calibrating" p values ### The argument * In fixed `\(N\)` designs, * ...with some default ("reasonable") prior... * ...or any prior in a ("reasonable") class of priors... * ...comparing a (two sided) alternative... * ...against a point null... * ...the Bayes factor can be "small"... * ...while the `\(p\)` value is `\(<0.05\)`. --- # RS Bayes factor approaches Point null vs two sided alternatives: ### Likelihood * ...best point alternatives (maximum likelihood) ### Johnson * ...point alternatives with specified "power" ### Berger * ...best case among class of priors --- # What are we testing? | | Test | Bayes factor | |---|-------------------------|---------------------| | Point vs 1-sided | >.5, vs null | 5.5 | | Dividing | >.5, vs ≤ .5 | <sup>†</sup>32.4 | | Point vs 2-sided | >.5 OR <.5, vs null | <sup>††</sup>2.8 | <br/> | | Test | p value | |--|-------------------------|---------------------| | Point vs 1-sided | <.5, vs null | 0.018 | | Dividing | .5, vs below .5 | 0.018 | | Point vs 2-sided | Above OR below .5, vs null | 0.035 | .footnote[<sup>†</sup> Assuming prior for values <.5 are the mirror images of those >.5.<br/><sup>††</sup>Assuming equal weights on each.] --- # Penalizing the evidence How do these practices affect the evidence? | Practice | p value | Bayes factor | |---------------|-----------|-----------------| | Choosing sign | .red[penalized] | no effect | Data peeking / sequential designs | .red[penalized] | no effect | | Multiple hypotheses | .red[penalized] | no effect | * The dividing Bayes factor has *unlimited* Bayes factor under sequential sampling *even under the null*. * The `\(p\)` value would be penalized. --- # What about those RS arguments? ### Argument's assumptions not essential to Bayes factors. * Assume fixed designs * Won't reveal when BFs appear to "overstate" evidence * Point null * Reduces evidence for alternative * Two-sided priors: * Reduces evidence for alternative --- # Arguments for Bayes factors undercut RS! <br/> * "Dependence on prior is good flexibility, not a problem" * Then, no *general* calibration against p values * "Multiple testing can be handled with prior odds" * But, it is always argued the prior odds don't matter * "p values are sensitive to optional stopping, BFs aren't" * Then the argument should reflect that! --- # Wrap-up * Bayesian factors: Statistical evidence, from a Bayesian perspective * Bayesian arguments cuts two ways: * Flexibility: No general `\(p\)`/BF calibration * Insensitivity: BF can "overstate" relative to `\(p\)` <br/><br/> > "I do deal with likelihood function and, occasionally, calculate the maximum likelihood estimators. However, I do so not as a matter of principle, but only in those cases when the frequency properties of the estimators fit my purposes." — Neyman, 1977 --- # My own thoughts * Frequentist/Bayesian are useless labels * More variance within than between! * A good statistician's intuitions are "Bayesian" and "frequentist" * How often could I be misled in this idealized situation? * How would idealized beliefs change? * Reducing crime by making it legal is a questionable approach.