8 Reichenbach's principle of the common cause
Tải bản đầy đủ - 0trang Natural selection
231
causes Y from the hypothesis that X and Y are joint effects of a common
cause. Still, if the argument merely pits the hypothesis that X causes Y
against the hypothesis that X and Y are causally independent, it makes
sense, and the law of likelihood explains why.28
I now want to examine a different approach to this testing problem. It
appeals to an idea that Hans Reichenbach (1956) called the principle of the
common cause. This principle says that if the variables X and Y are correlated, then either X causes Y, Y causes X, or X and Y are joint effects of a
common cause.29 These three possibilities define what it means for X and
Y to be causally connected. Reichenbach’s principle has been central to the
Bayes net literature in computer science; it is closely connected with the
causal Markov condition (see Spirtes et al. 2001 and Woodward 2003).
Although Reichenbach’s principle and the likelihood approach I have
taken may seem to be getting at the same thing, I think there is a deep
difference. In fact, if the likelihood approach is right, then Reichenbach’s
principle must be too strong. The likelihood approach does not say that X
and Y must be causally connected if they are correlated; it doesn’t even say
that they probably are. The most that the law of likelihood permits one to
conclude is that the hypothesis of causal connection is better supported by
correlational data than is the hypothesis of causal independence.
To delve deeper into the principle of the common cause, let’s begin
with an example that Reichenbach used to illustrate it. Consider an acting
troupe that travels around the country presenting plays. We follow the
company for several years, recording on each day whether the leading man
and the leading lady have upset stomachs. This data allow us to see how
frequently each of them gets sick and how frequently both of them get
sick. Suppose the following inequality is true:
ð1Þ
f ðActor 1 gets sick & Actor 2 gets sickÞ
> f ðActor 1 gets sickÞf ðActor 2 gets sickÞ:
28
29
The complaint of Leroi et al. (1994) that the comparative method does not get at the causal basis
of selection (because it fails to pry apart selection-of from selection-for, on which see Sober 1984)
needs to be understood in this light.
Reichenbach additionally believed that when X and Y are correlated and neither causes the other,
not only does there exist a common cause of X and Y; in addition, if all the common causes
affecting X and Y are taken into account, they will screen off X from Y, meaning that the completely
specified common causes will render X and Y conditionally probabilistically independent of each
other. Results in quantum mechanics pertaining to the Bell inequality have led many to question
this screening-off requirement (see, for example, Van Fraassen 1982).
Natural selection
232
Observed
association:
Probabilistic
correlation
f(X&Y) > f(X)f(Y )
Pr(X&Y ) > Pr(X)Pr(Y)
Causal
hypothesis:
X,Y are
causally
connected
Figure 3.22 Although the principle of the common cause is sometimes described as
saying that an ‘‘observed correlation’’ entails a causal connection, it is better to divide the
inference into two steps.
Here f(e) means the frequency of days on which the event (type) e occurs.
For example, maybe each actor gets sick once every twenty days, but the
1
frequency of days on which both get sick is greater than 400
. If this
inequality is big enough and we have enough data, our observations will
license the inference that the following probabilistic inequality is true:
ð2Þ For each day i; PrðActor 1 gets sick on day i &
Actor 2 gets sick on day iÞ
>PrðActor 1 gets sick on day iÞPrðActor 2 gets sick on day iÞ:
It is important to be clear on the difference between the observed association
described in (1) and the inferred correlation stated in (2); this distinction was
discussed in §2.18 in connection with the inductive sampling formulation of
the argument from design. The association is in our data. However, we do
not observe probabilities; rather, we infer them.30 Once our frequency data
permit (2) to be inferred, the principle of the common cause kicks in,
concluding that there is a causal connection between one actor’s getting sick
on a given day and the other’s getting sick then too. Perhaps the correlation
exists because the two actors eat in the same restaurants; if one of them eats
tainted food on a given day, the other probably does too. The two-step
inference just described is depicted in Figure 3.22.
How could X and Y be associated in the data without being correlated?
Perhaps the sample size is too small. If you toss a pair of coins ten times, it
is possible that heads on one will be associated with heads on the other, in
30
In this inference from sample frequencies to probabilities, Bayesians will claim that prior
probabilities are needed while frequentists will deny that this is necessary. Set that disagreement
aside.
Natural selection
233
the sense that there is an inequality among the relevant frequencies. But
this may just be a fluke; the tosses may in fact be probabilistically independent of each other. One way to see whether this is so is to do a larger
experiment. If the association in the ten tosses is just a fluke, you expect
the association to disappear as sample size is increased.
The principle of the common cause sounds like a sensible idea when it
is considered in connection with examples like Reichenbach’s acting
troupe. But is it always true? Quantum mechanics has alerted us to the
possibility that X and Y might be correlated without being causally
connected; maybe there are stable correlations that are just brute facts.
However, it is not necessary to consider the world of micro-physics to find
problems for Reichenbach’s principle. Yule (1926) described a class of
cases in which X and Y are causally independent though probabilistically
correlated. A hypothetical example of the kind of situation he had in
mind is provided by the positive association of sea levels in Venice and
bread prices in Britain over the past 200 years (Sober 2001). Since both
have increased monotonically, higher than average values of the one are
associated with higher than average values of the other. This association is
not due to sampling error; if yearly data were supplemented with monthly
data from the same 200 years, the pattern would persist. Nor is this
problem for Reichenbach’s principle restricted to time series data. Variables can be spatially rather than temporally associated, due to two causally
independent processes each leading a variable to monotonically increase
across some stretch of terrain. Suppose that bread prices on a certain date
in the year 2008 increase along a line that runs from southeast to
northwest Europe. And suppose that songbirds on that day are larger in
the northwest than they are in the southeast. If so, higher bread prices are
spatially associated with larger songbirds. And the association is not a
fluke, in that the pattern of association persists with larger sample size.
But still, songbird size and bread prices may well be causally independent.
Reichenbach’s principle is too strong. The probabilistic correlation
between X and Y may be due to the fact that X and Y are causally
connected. However, to evaluate this possibility, we must consider
alternatives. If the alternatives we examine have lower likelihoods, relative
to data on observed frequencies, this provides evidence in favor of the
hypothesis of causal connection. On the other hand, if we consider an
alternative hypothesis that has the same likelihood as the hypothesis of
causal connection, then the data do not favor one hypothesis over the
other, or so the law of likelihood asserts. There is no iron law of metaphysics that says that a correlation between two variables must be due to
234
Natural selection
their being causally connected. Whether this is true in a given case should
be evaluated by considering the data and a set of alternative hypotheses,
not by appealing to a principle.
Those who accept Reichenbach’s principle invariably think that it is
useful as well as true. They do not affirm that correlation entails causal
connection only to deny that we can ever know that a correlation exists.31
Reichenbach’s treatment of the example of the two actors is entirely
typical. The data tell you that there is a correlation, and the correlation
tells you that there is a causal connection. This readiness to use Reichenbach’s principle to draw causal inferences from observed associations
suggests the following argument against the principle. Take a data set that
you think amply supports the claim that variables X and Y are probabilistically correlated. If you believe Reichenbach’s principle, you are
prepared to further conclude that X and Y must be causally connected.
But do you really believe that the data in front of you could not possibly
have been produced without X and Y being causally connected? For
example, take Allman et al.’s data set (§3.7). Surely it is not impossible that
each primate species (both those in the data set and those not included)
came to its values for X and Y by its own special suite of causal processes.
I do not say that this is true or even plausible, only that it is possible. This
is enough to show that Reichenbach’s principle is too strong.
Although the example I have considered to make my argument against
Reichenbach’s principle involves a data set in which two variables
monotonically increase with time, the same point holds for a data set in
which the variables each rise and fall irregularly but in seeming synchrony.
If a common cause model is plausible in this case, this is not because a
Reichenbachian principle says that it must be true. Rather, its credentials
need to be established within a contrastive inferential framework, whether
the governing principle is the law of likelihood or a model selection
criterion like AIC. For a monotonic data set, a fairly simple commoncause model and a somewhat more complex separate-cause model each fit
the data well, in which case the former will have a slightly better AIC score
than the latter. When the data set is a lot more complex, a common-cause
model that achieves good fit will have far fewer adjustable parameters than
a separate-cause model that does the same, in which case the difference in
their AIC scores will be more substantial. It does not much strain our
31
It is tempting to argue that Venetian sea levels and British bread prices really aren’t correlated
because if enough data were drawn from times outside the 200-years period, the association would
disappear. Well, maybe it would, but so what? Why must real correlations be temporally (and
spatially) unbounded?
Natural selection
235
credulity to imagine that the steady rise in British bread prices and
Venetian sea levels is due to separate causes acting on each; the strain may
be more daunting for two time series that have lots of synchronous and
irregular wiggles. But this difference is a matter of degree and the relevant
inferential principles are the same. Strong metaphysics needs to be
replaced by more modest epistemology.32
3.9
TESTING SELECTION AGAINST DRIFT WITH
MOLECULAR DATA
The shift from the task of explaining a single trait value in a single species
(§3.1–§3.5) to that of explaining a correlation that exists across species
(§3.6) renders the problem of testing selection against drift more tractable.
In the former case, you need to know the location of the optimal trait
value towards which selection, if it occurs, will push the lineage. In the
latter, all that is needed is information about the slope of the optimality
line. Instead of needing to know what the optimal fur length is for the
polar bear lineage, it suffices to know that, if selection acts on fur length,
bears in cold climates have a longer optimal fur length than bears in warm.
A great deal of work in population genetics attempts to get by with
even less. Geneticists often test selection against drift by comparing DNA
sequences drawn from different species; a number of statistical tests have
been constructed for doing this (see Page and Holmes 1998, Kreitman
2000, and Nielsen 2005 for reviews). Scientists carry out these tests with
little or no information about the roles that different parts of these
sequences play in the construction of an organism’s phenotype. If these
tests are sound, they require no assumptions concerning what the optimal
sequence configuration would be; in fact, they don’t even require assumptions concerning how the optimum in one species is related to the optimum
in another. If selection can be tested without this type of information, what
does the hypothesis of natural selection predict about what we observe?
The parallel question about what drift predicts is easier to answer. The
predictions are probabilistic, not deductive. They take the following
form: If the process is one of pure drift, then the probability of this or that
observable result is such-and-such. If the observations turn out to deviate
32
Hoover (2003) proposes a patch for Reichenbach’s principle that introduces considerations
concerning stationarity and cointegration, but his proposal is still too strong; it isn’t true that a
data set that satisfies his requirements must be due to the two variables’ being causally connected.
In addition, Hoover’s reformulation makes no recommendations concerning some data sets that in
fact do favor a common cause over a separate cause model.
236
Natural selection
from what the drift hypothesis leads you to expect, should you reject it?
If you should, and if selection and drift are the only two alternatives,
selection has been ‘‘tested’’ by standing idly on the sidelines and witnessing the refutation of its one and only rival. If probabilistic modus
tollens (§1.4) made sense, this would be fine. But it does not. For selection
and drift to be tested against each other, both must make predictions. This
is harder to achieve for selection than it is for drift, since drift is a null
hypothesis (predicting that there should be zero difference between
various quantities; see below) whereas selection is a composite hypothesis
(predicting a difference but leaving open what its magnitude should be).
A central prediction of the neutral theory of molecular evolution is that
there should be a molecular clock (Kimura 1983). In a diploid population
containing N individuals, there are 2N nucleotides at a given site. If each
of those nucleotides has a probability l of mutating in a given amount of
time (e.g., a year), and a mutated nucleotide has a probability u of evolving
from mutation frequency to fixation (i.e., 100 percent representation in
the population), then the expected rate of substitution (i.e., the origination
and fixation of new mutations) at that site in that population will be
k ¼ 2N lu:
This covers all the mutations that might occur, regardless of whether they
are advantageous, neutral, or deleterious. Because l and u are probabilities, k isn’t the de-facto rate of substitution; rather, it is a probabilistic
quantity – an expected value (§1.4). If the 2N nucleotides found at a site
at a given time are equal in fitness, the initial probability that each has of
eventually reaching fixation is
u¼
1
:
2N
I say that this is the ‘‘initial’’ probability since the probability of fixation
itself evolves.33 If we substitute 1/2N for u in the first equation, we obtain
one of the most fundamental propositions of the neutral theory:
Neutralityị k ẳ l:
33
This equality pertains to the 2N token nucleotides present at the start of the process; some of those
tokens may be of the same type. It follows from the above equality that the initial probability that a
type of nucleotide found at time t will eventually reach fixation is its frequency at time t. This
point applies to phenotypic drift models as well as genetic ones; see the squashing of the bell curve
depicted in Figure 3.4.
Natural selection
237
The expected rate of substitution at a site is given by the mutation rate if
the site is evolving by drift. Notice that the population size N has cancelled out.
What will happen if a mutation is advantageous? Its probability of
fixation (u) depends on the selection coefficient (s), on the effective
population size Ne, and on the population’s census size N:
u¼
2sNe 34
:
N
Substituting this value for u into the first equation displayed above, we
obtain the expected rate of evolution at a site that experiences positive
selection:
Selectionị k ẳ 4Ne sl:
If the probability of mutation per unit time at each site remains constant
through time (though different sites may have different mutation probabilities), the neutral theory predicts that the expected overall rate of
evolution in the lineage does not change. This is the clock hypothesis. It
doesn’t mean that the actual rate never changes; there can be fluctuations
around the mean (expected) value. The selection hypothesis is more
complicated. If each site’s value for Ne sl holds constant through time,
(Selection) also entails the clock hypothesis. But there is every reason to
expect this quantity to fluctuate. After all, Ne is a quantity that reflects the
breeding structure as well as the census size of the population (Crow and
Kimura 1970) whereas s, the selection coefficient, reflects the ecological
relationship that obtains between a nucleotide in an organism and the
environment. With both these quantities subject to fluctuation, it would
be a miracle if their product remained unchanged. This is why (Selection)
is taken to predict that there is no molecular clock.35
If we could trace a single lineage through time, taking molecular
snapshots on several occasions, it would be easy to test (Neutrality)
against (Selection). Although this procedure can be carried out on
populations of rapidly reproducing organisms, it isn’t feasible with respect
to lineages at longer time scales. It is here that the fact of common
ancestry comes to the rescue, just as it did in §3.6. We do not need a time
34
35
This useful approximation is strictly correct only for small s and large N.
The simple selection model described here does not predict a clock, but more complicated
selection models sometimes do. Discussion of these would take us too far afield.
Natural selection
238
Old World
monkey
1
Human
2
New World
monkey
3
Figure 3.23 Given the phylogeny, the neutral theory entails that the expected difference
between 1 and 3 equals the expected difference between 2 and 3 (figure from Page and
Holmes 1998: 255).
machine that allows us to travel into the past so that we can observe earlier
states of a lineage we now see in the present; rather, we can look at three or
more tips in a phylogenetic tree and perform a relative rates test. Figure 3.23
provides an independently justified phylogeny of human beings, Old
World monkeys, and New World monkeys. We can gather sequence
data from these three taxa and observe how many differences there are
between each pair. The neutral hypothesis predicts that (d13 À d23) ¼ 0
(or, more precisely, it entails that the expected value of this difference is
zero); here dij is the number of differences between i and j. We don’t
need to know how many changes occurred in each lineage; it suffices to
know how many changes separate one extant group from another.
Looking at the present tells you what must have occurred in the past,
given the fact of common ancestry.
Li et al. (1987) carried out this test and discovered that d13 is significantly greater than d23; they didn’t look at whole genomes but at a
sample of synonymous sites, introns, flanking regions, and a pseudo-gene,
totaling about 9,700 base pairs. With respect to these parts of the genome,
human beings diverged more slowly than Old World monkeys from their
most recent common ancestor (see also Li 1993). This result was taken to
favor (Selection) over (Neutrality). In view of the negative comments I
made about Neyman–Pearson hypothesis testing in Chapter 1, I want to
examine the logic behind this analysis more carefully. Using relative rates
to test drift against selection resembles using two sample means to test
whether two fields of corn have the same mean height. The null
Natural selection
239
hypothesis says that they are the same; the alternative to the null says that
they differ, but it does not say by how much. The Neyman–Pearson
theory conceives of the testing problem in terms of acceptance and
rejection and requires that one stipulate an arbitrary level for a, the
probability of a Type-1 error. I suggested in Chapter 1 that it makes more
sense to place this problem in a model-selection framework. In the
relative-rates test, the drift hypothesis has no adjustable parameters
whereas the selection hypothesis has one. The question is not which
hypothesis to reject but which can be expected to be more predictively
accurate. The AIC score of the selection hypothesis is found by determining the maximum likelihood value of a single parameter h, the
expected value of (d13 À d23), taking the log-likelihood when h is assigned
its maximum likelihood value and subtracting the penalty for complexity.
The question is whether the selection hypothesis’ better fit to data suffices
to compensate for its greater complexity.
Bayesians and likelihoodists come at the problem differently. Their
framework obliges them to compute the average likelihood of the selection
hypothesis, which, as noted, is composite. If selection acted on these
different parts of the genome, how much should we expect d13 and d23
to differ? We would need to answer this question without looking at
the data. And since different types of selection predict different values for
(d13 À d23), our answer would have to average over these different possibilities. It isn’t impossible that empirical information should one day
provide a real answer to this question. However, at present, there is no
objective basis for producing an answer. I suggest that the model-selection
approach is more defensible than both Bayesianism and Neyman–Pearson
hypothesis testing as a tool for structuring the relative rate test.
The role played by the fact of common ancestry in facilitating tests of
process hypotheses can be seen in another context. If we could look at a
large number of replicate populations that all begin in the same state, we
could see if the variation among the end states of those lineages is closer to
the predictions of neutrality or selection. But why think that each of these
lineages begins in the same state? The answer is simple: If they share a
common ancestor, they must have. The neutral theory predicts that the tips
of a tree should vary according to a Poisson distribution. A number of
mammalian proteins (e.g., Hemoglobin a and b, Cytochrome c, Myoglobin)
were found to be ‘‘over dispersed’’ (Kimura 1983; Gillespie 1986), and
this was taken to be evidence of selection. Once again, neutrality is a null
hypothesis, and selection is composite.
240
Natural selection
Like the relative rate test, the McDonald–Kreitman test also relies on
an independently justified phylogeny, and there is no optimality line in
sight. McDonald and Kreitman (1991) examined sequences from three
species of Drosophila that all play a role in constructing the protein alcohol dehydrogenase (Adh). Fruit flies often eat fruit that is fermented,
and they need to break down the alcohol (human beings have the same
problem and solve it by way of a different version of the same protein). So
it seems obvious that the protein is adaptive. What is less obvious is
whether variations in the gene sequences that code for the protein are
adaptive or neutral. Perhaps different species find it useful to have different versions of the protein, and selection has caused these species to
diverge from each other. And different local populations that belong to
the same species may also have encountered different environments that
select for different sequences. Alternatively, the variation may be neutral.
The McDonald–Kreitman test compares the synonymous and nonsynonymous differences that are found both in different populations of
the same species and in different species. A substitution in a codon is said
to be synonymous when it does not affect the amino acid that results. For
example, the codons UUU and UUC both produce the amino acid
phenylalanine, while CUU, CUC, CUA and CUG all produce leucine. It
might seem obvious that synonymous substitutions must be caused by
neutral evolution, since they do not affect which amino acids and proteins
are constructed downstream. This would suggest that the hypothesis that
nonsynonymous substitutions evolve neutrally can be tested by seeing if
the rates of synonymous and nonsynonymous substitutions are the same.
However, it is possible that synonymous substitutions might not be
neutral, owing, for example, to differences in secondary structure, for
example, having to do with the stability of the molecules (Page and
Holmes 1998: 243). What seems safer is the inference that the ratio of the
rates of synonymous to nonsynonymous substitutions should be a constant if there is neutral evolution. Both should depend just on the
mutation rate, as discussed above. This ratio should have the same value
regardless of whether the sequences compared come from two populations of the same species or from different species.
Figure 3.24 describes the four kinds of observations that McDonald
and Kreitman assembled. They counted the number of nonsynonymous
and synonymous differences that separate different populations of the
same species; these are called polymorphisms. They also counted the
number of synonymous and nonsynonymous fixed differences that separate
pairs of species; these are sites that are monomorphic within each species
Natural selection
D. Melanogaster
2/14
1/2
D. simulans
0/11
1/0
241
D. yakuba
0/17
5/15
polymorphisms
fixed differences
Figure 3.24 The number of nonsynonymous and synonymous differences that exist
within and between three Drosophila species at the Adh locus. The within-species
differences are called polymorphisms; the between-species differences are called fixed.
Data from McDonald and Kreitman (1991); figure from Page and Holmes (1998: 267).
but vary between them. Here are the numbers of differences that
McDonald and Kreitman found in the four categories:
Fixed
Synonymous
17
Nonsynonymous 7
Polymorphic
42
2
If these sequences evolved neutrally, the ratio of synonymous to nonsynonymous substitutions for the first column should be about the same
as the ratio for the second. But they aren’t close:
17
42
(
:
7
2
There is an excess of nonsynonymous fixed differences (or a deficiency of
polymorphic nonsynonymous substitutions). McDonald and Kreitman
took this departure from the prediction of neutrality to be evidence for
selection: Selection had reduced the within-species variation and amplified the between-species variation at nonsynonymous sites. They note that
a population bottleneck could also explain the data but argue that the
known history of Drosophila makes this alternative implausible.
The same epistemological questions arise in connection with the
McDonald–Kreitman test that I raised about the relative rates test. The
inference should not be thought of as an instance of probabilistic modus