Bayes Inference, Bayes Factor, Model Selection
Which model is statistically preferred by the data and by how much?
data:image/s3,"s3://crabby-images/56474/5647415697e7096abd1d76c8a56bc7c453609869" alt="An optional description of the image for screen readers."
Bayes Inference
A primary aim of modern Bayesian inference is to construct a posterior distribution
According to Bayes theorem, the posterior distribution for multimessenger astrophysics is given by
The likelihood function is something that we choose. It is a description of the measurement. By writing down a likelihood, we implicitly introduce a noise model. For GW astronomy, we typically assume a Gaussian-noise likelihood function that looks something like this
In practical terms, the evidence is a single number. It usually does not mean anything by itself, but becomes useful when we compare one evidence with another evidence. Formally, the evidence is a likelihood function. Specifically, it is the completely marginalised likelihood function. It is therefore sometimes denoted
Bayes Factor
Recall from probability theory:
We wish to distinguish between two hypotheses:
Bayes’s theorm can be expressed in a form mare convenient for our pruposes by employing the completeness relationship. Using the completeness relation (note:
The ratio of the evidence for two different models is called the Bayes factor. For example, we can compare the evidence for a BBH waveform predicted by general relativity (model
Formally, the correct metric to compare two models is not the Bayes factor, but rather the odds ratio. The odds ratio is the product of the Bayes factor with the prior odds
In many practical applications, we set the prior odds ratio to unity, and so the odds ratio is the Bayes factor. This practice is sensible in many applications where our intuition tells us: until we do this measurement both hypotheses are equally likely.
There are some (fairly uncommon) examples where we might choose a different prior odds ratio. For example, we may construct a model in which general relativity (GR) is wrong. We may further suppose that there are multiple different ways in which it could be wrong, each corresponding to a different GR-is-wrong sub-hypothesis. If we calculated the odds ratio comparing one of these GR-is-wrong sub-hypotheses to the GR-is-right hypothesis, we would not assign equal prior odds to both hypotheses. Rather, we would assign at most 50% probability to the entire GR-is-wrong hypothesis, which would then have to be split among the various sub-hypotheses.
Model Selection
Bayesian evidence encodes two pieces of information:
- The likelihood tells us how well our model fits the data.
- The act of marginalisation tell us about the size of the volume of parameter space we used to carry out a fit.
This creates a sort of tension:
We want to get the best fit possible (high likelihood) but with a minimum prior volume.
A model with a decent fit and a small prior volume often yields a greater evidence than a model with an excellent fit and a huge prior volume.
In these cases, the Bayes factor penalises the more complicated model for being too complicated.
We can obtain some insights into the model evidence by making a simple approximation to the integral over parameters:
- Consider first the case of a model having a single parameter
. ( ) - Assume that the posterior distribution is sharply peaked around the most probable value
, with width , then we can approximate the in- tegral by the value of the integrand at its maximum times the width of the peak. - Assume that the prior is flat with width
so that .
data:image/s3,"s3://crabby-images/929d8/929d8ff04f137647223663f3b3889a7b127be571" alt="A rough approximation to the model evidence if we assume that the posterior distribution over parameters is sharply peaked around its mode $w_\text{MAP}$ ."
then we have
- The first term: the fit to the data given by the most probable parameter values, and for a flat prior this would correspond to the log likelihood.
- The second term (also called Occam factor) penalizes the model according to its complexity. Because
this term is negative, and it increases in magnitude as the ratio gets smaller. Thus, if parameters are finely tuned to the data in the posterior distribution, then the penalty term is large.
Thus, in this very simple approximation, the size of the complexity penalty increases linearly with the number
A further insight into Bayesian model comparison and understand how the marginal likelihood can favour models of intermediate complexity by considering the Figure below.
data:image/s3,"s3://crabby-images/1ccf5/1ccf5f08394fe2089bba3cac6be260997884d2da" alt="Schematic illustration of the distribution of data sets for three models of different complexity, in which $M_1$ is the simplest and $M_3$ is the most complex. Note that the distributions are normalized. In this example, for the particular observed data set $\mathcal{D}_0$, the model $M_2$ with intermediate complexity has the largest evidence (the area und curve) ."
More insights:
-
If we compare two models where one model is a superset of the other—for example, we might compare GR and GR with non-tensor modes—and if the data are better explained by the simpler model, the log Bayes factor is typically modest,
Thus, it is difficult to completely rule out extensions to existing theories. We just obtain ever tighter constraints on the extended parameter space. -
To make good use of Bayesian model comparison, we fully specify priors that are independent of the current data
. -
The sensitivity of the marginal likelihood to the prior range depends on the shape of the prior and is much greater for a uniform prior than a scale-invariant prior (see e.g. Gregory, 2005b, 61).
-
In most instances we are not particularly interested in the Occam factor itself, but only in the relative probabilities of the competing models as expressed by the Bayes factors. Because the Occam factor arises automatically in the marginalisation procedure, its effects will be present in any model-comparison calculation.
-
No Occam factors arise in parameter-estimation problems. Parameter estimation can be viewed as model comparison where the competing models have the same complexity so the Occam penalties are identical and cancel out.
-
On average the Bayes factor will always favour the correct model.
To see this, consider two models
and in which the truth corresponds to . We assume that the true posterior distribution from which the data are considered is contained within the set of models under consideration. For a given finite data, it is possible for the Bayes factor to be larger for the incorrect model. However, if we average the Bayes factor over the distribution of data sets, we obtain the expected Bayes factor in the form where the average has been taken with respect to the true distribution of the data. This is an example of Kullback-Leibler divergence and satisfies the property of always being positive unless the two distributions are equal in which case it is zero. Thus on average the Bayes factor will always favour the correct model.
We have seen from Figure 1 that the model evidence can be sensitive to many aspects of the prior, such as the behaviour in the tails. Indeed, the evidence is not defined if the prior is improper, as can be seen by noting that an improper prior has an arbitrary scaling factor (in other words, the normalization coefficient is not defined because the distribution cannot be normalized). If we consider a proper prior and then take a suitable limit in order to obtain an improper prior (for example, a Gaussian prior in which we take the limit of infinite variance) then the evidence will go to zero, as can be seen Figure 1 and the equation below the Figure 1. It may, however, be possible to consider the evidence ratio between two models first and then take a limit to obtain a meaningful answer.
In a practical application, therefore, it will be wise to keep aside an independent test set of data on which to evaluate the overall performance of the final system.
Reference
-
By referring to model parameters, we are implicitly acknowledging that we begin with some model. Some authors make this explicit by writing the posterior as
, where is the model. (Other authors sometimes use to denote the model.) We find this notation clunky and unnecessary since it goes without saying that one must always assume some model. If/when we consider two distinct models, we add an additional variable to denote the model. ↩︎