It's the idea that instead of looking at the probability of data fitting a model - It's optimising the probability that the model fits the data.

Bayesian statistics is a bit more than "model selection", I would say.

Bayes Rule:At some point in the first half of the 18th century a Presbyterian minister, Thomas Bayes was attempting to formulate a statistical justification for the existence of God. To my knowledge he did not conclusively do this. However in the process he stumbled across a simple recombination of the definition of conditional probability, P(A,B)=P(A|B)P(B)hence we can write,P(A|B)P(B)=P(B|A)P(A)since the joint probability is 'symmetric' in the events A and B. From this it follows that,P(A|B)=P(B|A)P(A)/P(B) (Bayes Rule)While simple, this is a really cool formula. Here is why: A common mistake that people make in reasoning about probability is to assume that P(A|B)=P(B|A), i.e. that the probability of event A conditional on event B is equal to probability of event B conditional on event A. Clearly this is a special case of Bayes rule (when P(A)=P(B)) but is not true in general. So Bayes illustrates how to do 'inverse' probability correctly - you have to pay attention to the 'marginals'! (P(A) and P(B).) Bayesian Statistics:So far so simple, but wait. This innocent looking formula permits a volte face in the concept of probability that has significant consequences. Bayes rule is more than just a formula allowing us to do problems of inverse probability correctly (eg. Google the 'Monty Hall' problem if you dont know it already), it also allows us to reason about how to combine existing (or 'prior') knowledge with new observations, beliefs or models. It is this process to which the label 'Bayesian statistics' typically applies. In this case P(A) is called the 'prior' since it encodes a previous belief about the possible outcomes in the 'posterior', P(A|B). The denominator P(B) is known as the 'evidence'. The conditional probability on the RHS P(B|A) is known as the 'likelihood'. This can be explicitly found from data (empirical likelihood), or as is often the case in Bayesian methods, it can encapsulate a model of the relationship between B and A (especially for high dimensional or difficult to sample data).A very important application of Bayesian statistics is in model fitting, see (1,2,3). Suppose that we have some data (D) and a model (M). We want to know the probability that the model is correct, i.e the probability of the model given the data P(M|D). Intuitively this is not such an obvious thing to compute. However, possibly given some assumptions, we can usually quite easily fix P(D) and P(M). With complete knowledge of our model it is also usually easy to figure out P(D|M), i.e. the probability that this model would generate the data. It is now a straightforward application of Bayes rule to find our desired posterior. Note that we can maximise P(D|M) directly (maximum likelihood), which is normally a good strategy if there is plenty of data. On the other hand, maximising P(M|D) is a better idea if there is less data but the prior is informative and reasonable/constrained (3).Philosopically, Bayesian statistics permits the combination of 'probabilities' with 'beliefs'. Whereas probability is a measure derived from a tally of occurances or non-occurances of events in the limit of infinite trials on an underlying sigma-algebra, beliefs and priors can be arrived at through a more relaxed procedure - such as a model or an estimate, and not strictly an infinite limit of trials. Practically speaking this means that knowledge about prior factors can be combined with empirical or theoretical probabilities within a principled framework. Bayesian statistics is therefore dialectical to frequentist statistics, in which frequencies of events - and not anything else - determine probabilities. Importantly the resulting system of probabilisitic reasoning is identical to standard logic in the limit of certainties (for example when P(B|A)=1), see (3,4).(1)Chris Bishop, Pattern Recognition and Machine learning (http://research.microsoft.com/en-us/um/ ... /index.htm)(2)David McKay, Information theory, Inference and Learning Algorithms(3)David Barber, Bayesian Reasoning and Machine Learning(4)R.T. Cox (1946) Probability, Frequency, and Reasonable Expectation, American Journal of Physics, 14, 1-13

Last edited by neuroguy on May 3rd, 2012, 10:00 pm, edited 1 time in total.

I would add the very important and complementary references below:1. Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation (2nd ed). Springer.2. Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.