The principle of Occam's Razor is very subtle and has been interpreted in several different ways. Here we discuss the main interpretations we know of and highlight the differences between them. At the end, we describe two distinct underlying principles that we believe are at the heart of these various interpretations.
1. When deciding between two models which make equivalent predictions, choose the simpler one.
This rule tells us which underlying model to favor when more than one will produce the results we want. (If you're not clear on the difference between a model and a classifier, see the section on philosophical foundations.)
The idea here is best seen in a historical example. Galileo showed that gravity pulls objects downward with a constant acceleration. Kepler separately observed that the planets follow nearly elliptical orbits around the sun. Then Newton showed that both of these laws could be derived as corollaries from his universal law of gravitation. Once this was shown, it was preferable to adopt Newton's unified theory, rather than the combination of Galileo's and Kepler's. The motivation for this preference comes from interpretation 1 above: better to assume one model which explains both observations than a separate model for each phenomenon. (In fact, Newton's theory turned out to match the astronomical data better than Kepler's, but it would have been preferable even if it had only been equivalent.)
In other words, this rule tells us to assume only what we need to assume for our observations to make sense. Once Newton's assumptions are shown to account for both astronomy and mechanics, Kepler's and Galileo's assumptions become redundant and, the principle tells us, should be discarded. Atheists often use a similar argument to undermine theism. God was as good an explanation for man's existence as any, they say, until Darwin showed that the fundamental laws of life could account for our evolution. This does not disprove the existence of God, but to the atheist it goes a long way towards making the assumption of His existence redundant, and consequently - by Occam's Razor - expendable.
2. If two decision rules classify the existing data equally well, the simpler one is more likely to classify future data correctly.
3. Given a simple decision rule A, and a much more complex rule B which classifies the existing data only slightly better than A, A is likely to classify future data better than B.
Both of these interpretations are really just special cases following from a more primitive claim, which holds that when assessing the likelihood of future success of different classifiers, we should consider not only their performance on existing data but also their relative simplicities. In other words:
4. Simpler classifiers are more likely to be correct.
Consider the example of the three chili/tomato classifiers from the introduction. In any typical circumstance we prefer the simple quadratic curve (B) over the complicated curve (C) - that is, we think B is more likely to correctly classify future data than C. This is because we believe the smooth shape of B probably reflects some reliable property of the data, whereas the squiggles of C depend only on random noise in the existing sample. But where does this belief come from?
Interpretation 4 above is the answer. If simple classifiers were no more likely to be correct than complicated ones, we would prefer C over B, since it agrees better with the available data. It is only the bias for simplicity that motivates us to choose B.
The Two Underlying Principles
We claim that all standard interpretations of Occam's Razor derive their force from one or both of two separate principles, exemplified by 1 and 4 above.
The first principle (1) states that one should favor the simplest model which explains the observations. So, assume the minimal assumptions of Newton, rather than the combined asumptions of Galileo and Kepler. We will refer to this principle as Occam's Razor proper, since it is closer to the principle formulated and applied by William of Ockham. Note that this is a claim about what we should believe - an epistemological principle.
The second principle (4) states that simpler classifiers are more likely to be correct. So, all else being equal, we favor a simpler classifier over a more complicated one. We even prefer simpler classifier B over classifier C despite a small decrease in agreement with the training set. This is essentially a metaphysical claim, positing a general (probabilistic) property about the world, and we will refer to it as the Rule of Simplicity to distinguish it from Ockham's more epistemological principle.
The relationship between these principles is nuanced and deserves treatment at greater length than we can afford here. Consider the Rule of Simplicity. Is it a contingent property of the world, or is it in some sense a necessary property of any imaginable world? Can we imagine a world in which the Rule of Simplicity does not apply? Or alternatively, might the Rule in some way be an a priori consequence of Occam's Razor proper? These are profound and difficult questions, which deserve to be struggled with by the scientists and philosophers whose work they underlie.