class: left, bottom, my-title, title-slide .title[ # Bayes factors, p values, and the replication crisis ] .author[ ### Richard D. Morey ] .date[ ### 22 September 2022 ] --- <!--script src="https://hypothes.is/embed.js" async></script--> <style type="text/css"> .faded{ opacity: .3; } .red{ color: red; } .paperimg{ width: 90%; border-top: 1px solid black; border-left: 1px solid black; border-right: 1px solid black; } </style> ## Roadmap * Demonstrate what Bayes factors are * Show Bayes factors can show strong evidence for false (null) hypotheses * Show Bayes factors show strong evidence null, for suspicious null results ### Data setup * Two groups of size N * Normal populations * Equal standard deviations * Effect size: `\(\delta\)` (diff. in means, in std. dev. units) * Evidence: Student's `\(t\)` statistic --- ## Bayes factors in use .pull-left[ <img src='img/Rouder_etal_2009.png' class='paperimg'/><br/> Paper+software: over 4500 citations ] .pull-right[ <img src='img/bayesfactor.png' width='40%;float:left;'/> <img src='img/spss.png' width='40%;float:right;'/> <div style='height: 50px;'></div> <img src='img/jasp.svg' width='40%;float:left;'/> <img src='img/jamovi.svg' width='40%;float:right;'/> ] .footnote[ * Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. *Psychonomic Bulletin and Review, 16, 225–237*. * Morey, R. D., Rouder, J. N., Jamil, T., Urbanek, S., Forner, K., & Ly, A. (2022). BayesFactor: Computation of Bayes Factors for Common Designs (0.9.12-4.4). https://CRAN.R-project.org/package=BayesFactor ] --- count: false ## A start: Likelihood ratios <img src="index_files/figure-html/tdist1_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## A start: Likelihood ratios <img src="index_files/figure-html/tdist1_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-tdist1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-tdist1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-tdist1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false ## Effect of choosing different alternatives <img src="index_files/figure-html/lr1_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Effect of choosing different alternatives <img src="index_files/figure-html/lr1_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Effect of choosing different alternatives <img src="index_files/figure-html/lr1_user_03_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-lr1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-lr1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-lr1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Likelihood ratios .pull-left[ ### Frequentist * Use likelihood ratios as *test statistics* * Justified through Neyman-Pearson lemma (1933) and generalizations * Interpretation is through *error rates* of tests ] .pull-right[ ### Bayesian * Use likelihood ratios as measure of evidence * Justified through Bayes' theorem * Interpretation is direct ] --- ## Likelihood ratio to Bayes factor Probability of data is taken *averaged over the prior `\(\pi\)` and data model `\(p\)`*: `$$p_{\cal M}(\boldsymbol y) = \int_\Theta p_{\cal M}(\boldsymbol y\mid \boldsymbol\theta)\pi_ \cal M(\boldsymbol\theta) d\boldsymbol\theta$$` `$$BF_{10} = \frac{p_{\cal M_1}(\boldsymbol y)}{p_{\cal M_0}(\boldsymbol y)}$$` Justified through Bayes' theorem: `$$\frac{p(\cal M_1\mid\boldsymbol y)}{p(\cal M_0\mid\boldsymbol y)} = \frac{p_{\cal M_1}(\boldsymbol y)}{p_{\cal M_0}(\boldsymbol y)}\times\frac{p(\cal M_1)}{p(\cal M_0)}$$` --- ## Why are Bayes factors advocated? * Direct interpretation as statistical evidence: convincingness * Ability to show evidence for point nulls: evidence for regularity * Applicability beyond nested models * Model selection consistency in large sample limit * Perception that *p* values "overstate" evidence * Insensitivity to stopping rules Bayes factors have been suggested as a replacement for *p* values. --- ## Two main properties of Bayes factors .pull-left[ ### Marginal Bayes factors compute the probability of data *by averaging over the prior*. ] .pull-right[ ### Comparative Bayes factors compare *two models* at a time. ] <br/> .pull-left[ **Criticism**: Averaging over a prior can lead to high error rates for some elements of a hypothesis. Lack of *severity* (Mayo, 2018): High probability of claiming evidence for a false hypothesis. ] .pull-right[ **Criticism**: Both models may be obviously terrible. High probability under one hypothesis may be suspicious (too good), but BFs offer no account of this. ] --- ### "...but what's a good Bayes factor?"
Common interpretation of the magnitude of Bayes factors
For the numerator model ℳ<sub>1</sub>
BF range
Interpretation
0
-
1/3.16
(Supports ℳ<sub>0</sub>)
1/3.16
-
1
Barely worth mentioning (for ℳ<sub>0</sub>)
1
-
3.16
Barely worth mentioning
3.16
-
10
Substantial
1
10
-
31.62
Strong
31.62
-
100
Very strong
100
-
∞
Decisive
1
A Bayes factor of about 3 is often claimed to be calibrated to <i>p=0.05</i>. See Benjamin et al (2018). "Redefine statistical significance."
.footnote[ Jeffreys, H. (1961). *Theory of probability (3rd edition).* Oxford University Press. ] --- count: false ## Data predictions: p(y|δ) <img src="index_files/figure-html/datapred1_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Data predictions: p(y|δ) <img src="index_files/figure-html/datapred1_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Data predictions: p(y|δ) <img src="index_files/figure-html/datapred1_user_03_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-datapred1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-datapred1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-datapred1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false ## Prior distributions: 𝜋(δ) <img src="index_files/figure-html/prior1_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Prior distributions: 𝜋(δ) <img src="index_files/figure-html/prior1_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-prior1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-prior1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-prior1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Weighted predictions: p(y|δ)𝜋(δ) <img src="index_files/figure-html/unnamed-chunk-13-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## One alternative prior predictive <img src="index_files/figure-html/unnamed-chunk-14-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## How N affects the predictions <img src="index_files/figure-html/unnamed-chunk-15-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Some possible priors <img src="index_files/figure-html/allpriors_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Some possible priors <img src="index_files/figure-html/allpriors_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Some possible priors <img src="index_files/figure-html/allpriors_user_03_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-allpriors-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-allpriors-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-allpriors-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false ## Null and alternative prior predictives, 𝜋(y) <img src="index_files/figure-html/priorpred1_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Null and alternative prior predictives, 𝜋(y) <img src="index_files/figure-html/priorpred1_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Null and alternative prior predictives, 𝜋(y) <img src="index_files/figure-html/priorpred1_user_03_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Null and alternative prior predictives, 𝜋(y) <img src="index_files/figure-html/priorpred1_user_04_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-priorpred1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-priorpred1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-priorpred1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## New data setup * Four groups of equal size `\(N\)` * Effect size: `\(\omega^2\)`, "proportion of non-error variance" * Evidence: `\(F\)` statistic (one-way ANOVA) * Otherwise, the same as previous setup <center> <img src='img/Rouder_etal_2012.png' class='paperimg' style='width:40%;'/> </center> .footnote[ Rouder, J. N., Morey, R. D., Speckman, P. L., & Province, J. M. (2012). Default Bayes factors for ANOVA designs. *Journal of Mathematical Psychology, 56, 356–374.* ] --- ## Effect size ω² <img src="index_files/figure-html/unnamed-chunk-16-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Evidence for the null <img src="index_files/figure-html/evfromt_user_01_output-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Evidence for the null <img src="index_files/figure-html/evfromt_user_02_output-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Evidence for the null <img src="index_files/figure-html/evfromt_user_03_output-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Evidence for the null <img src="index_files/figure-html/evfromt_user_04_output-1.svg" width="100%" style="display: block; margin: auto;" /> <style> .panel1-evfromt-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-evfromt-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-evfromt-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- count: false ## Error rates 'accepting' the null <img src="index_files/figure-html/errrates1_user_01_output-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Error rates 'accepting' the null <img src="index_files/figure-html/errrates1_user_02_output-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Error rates 'accepting' the null <img src="index_files/figure-html/errrates1_user_03_output-1.svg" width="100%" style="display: block; margin: auto;" /> --- count: false ## Error rates 'accepting' the null <img src="index_files/figure-html/errrates1_user_04_output-1.svg" width="100%" style="display: block; margin: auto;" /> <style> .panel1-errrates1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-errrates1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-errrates1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## The result of marginalization * Probability of data is *averaged over* large effect sizes * Small observed effect sizes "look" null .pull-left[ ## Frequentist Protect from these high error rates by maximizing, not averaging. `$$p = \max_{\theta\in\Theta_0} Pr(\text{Extreme test statistic};\theta)$$` *Guarantees* error rates are controlled for *every* parameter value, not just on average. ] .pull-right[ ## Bayesian * Defend need for priors * Defend conservatism in general (this kind of error is "not so bad") * Appeal to parsimony ] --- ## "Too good to be true" nulls? When results are *too* null, people start to get suspicious. > "If it were just a question of having hit the bull's eye with a single shot we might conclude...that Mendel was simply lucky, but when a whole succession of shots comes close to the bull's eye we are entitled to invoke skill or some other factor." (Edwards, 1986, p. 303) <br/> > "Our result suggests a level of linearity that is extremely unlikely to have arisen from standard sampling under the null hypothesis of linearity. Any actual deviation from perfect linearity would even lower these probabilities." (Fraud complaint against J. Förster, 2012, p. 23) --- ### Why be suspicious? .pull-left[ ### Frequentist It is very unlikely to see test statistics so close to the null, particularly over and over (significance testing logic). ] .pull-right[ ### Likelihood/Bayes > "[A]ny *pattern* which we recognize in some data and which is unexplained on the current hypothesis is a signal that we should seek an alternative hypothesis, because an alternative which accounts for it is almost bound to have a higher likelihood." (Edwards, 1986, p. 303). But why were we suspicious in the first place...? ] --- count: false ## Evidence from small F statistics <img src="index_files/figure-html/smallt1_user_01_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Evidence from small F statistics <img src="index_files/figure-html/smallt1_user_02_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Evidence from small F statistics <img src="index_files/figure-html/smallt1_user_03_output-1.svg" width="80%" style="display: block; margin: auto;" /> --- count: false ## Evidence from small F statistics <img src="index_files/figure-html/smallt1_user_04_output-1.svg" width="80%" style="display: block; margin: auto;" /> <style> .panel1-smallt1-user { color: black; width: 99%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-smallt1-user { color: black; width: NA%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-smallt1-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ## Conclusions Bayesian *marginalization* and *comparative evidence* yield radically different results from significance testing logic. * Severity is lost: we can claim "strong" evidence `\((BF>10)\)` for the null when it is false * Flips Benjamin et al (2018)'s argument on its head: *Bayes factors* can make it easy to find misleadingly strong evidence. * Suspicious results are considered strong evidence for null (in spite of their unexpectedness under null) * Losing an intuitive forensic check is *bad* in a replication crisis. If we **replace** *p* values with Bayes factors, we lose important frequentist checks on Bayes factors!