2017, Oct 2

Singular Causation and Model Reduction

In my previous post I tried to get clear about when variables could be safely removed from a causal model without affecting what the model is capable of telling us about singular causal relations. There, I endorsed two principles stating when causal models may be reduced by excising variables in a particular way. If we endorse these principles, and we want to give a theory of singular causation formulated in terms of correct causal models, then we should want that theory to give the very same verdicts before and after model reduction. The point of today’s post is that there is a wide family of theories of causation which run afoul of this constraint. Those theories will say that two variable values are causally related in one model, but reverse this judgment when the model is reduced.

1. Counterfactual Counterfactual Theories of Singular Causation

1.1 Counterfactuals in Causal Models

Causal models allow us to evaluate certain causal counterfactual conditionals. For instance, recall the causal model describing the relations of causal determination between whether the switch is up, whether the power is on, and whether the light is illuminated.
$$ \begin{aligned} L &:= S \wedge P \\\
P &:= S \end{aligned} $$ Suppose that, actually, the switch is down, $S=1$, so that the power is on and the light is illuminated. If we want to evaluate the counterfactual conditional $P = 0 \hspace{4pt}\Box\hspace{-4pt}\to L = 1$ (were the power off, the light would be illuminated), we mutilate the model $\mathbb{M}$ by removing $P$'s equation, severing $P$'s dependence upon $S$, and setting its value to $0$ directly. That is, we exogenize the variable $P$, and add the assignment $P=0$ to the context $\vec{u}$. Graphically, we cut the arrow going into $P$, but leave all other arrows intact.

Call the resulting mutilated model “$\mathbb{M}[P \to 0]$”. The semantics for counterfactuals tells us that $P = 0 \hspace{4pt}\Box\hspace{-4pt}\to L = 1$ is true in the model $\mathbb{M}$ iff $L=1$ is true in the mutilated model $\mathbb{M}[P \to 1]$. $$ \mathbb{M} \models P = 0 \hspace{4pt}\Box\hspace{-4pt}\to L = 1 \quad \iff \quad \mathbb{M}[P \to 0] \models L =1 $$ Since $L=0$ in the mutilated model $\mathbb{M}[P \to 0]$, this tells us that the counterfactual $P = 0 \hspace{4pt}\Box\hspace{-4pt}\to L = 1$ is false in the original model $\mathbb{M}$.

More generally, if $\vec{X}$ is a vector of variables in $\mathbb{U} \cup \mathbb{V}$ and $\vec{x}$ is some assignment of values to those variables, then we may define $\mathbb{M}[\vec{X} \to \vec{x}]$ to be the mutilated model that you get by going through each variable $ X \in \vec{X}$ and, if $X$ is endogenous, removing $X$'s structural equation $\phi_X$ from $\mathbb{E}$, moving $X$ from $\mathbb{V}$ to $\mathbb{U}$, and adding the assignment $\vec{x}(X)$ to the context $\vec{u}$. (By the way, “$\vec{x}(X)$” is the value which $\vec{x}$ assigns to the variable $X$.) On the other hand, if $X \in \vec{X}$ is exogenous, then you simply change the context so that $\vec{u}(X) = \vec{x}(X)$. Then, for any $\phi$, we have that $$ \mathbb{M} \models \vec{X} = \vec{x} \hspace{4pt}\Box\hspace{-4pt}\to \phi \quad\iff \quad \mathbb{M}[\vec{X} \to \vec{x}] \models \phi $$

1.2 Counterfactual Counterfactual Depdendence

Many contemporary theories of causation fit into the following general schema, which we can call “Counterfactual Counterfactual":


Counterfactual Counterfactual. $C=c$ caused $E=e$ in causal model $\mathbb{M}$ iff there is some value of $C$, $c'$, such that $$ \mathbb{M}[\vec{G}\to\vec{g}] \models C = c’ \hspace{4pt}\Box\hspace{-4pt}\to E \neq e $$ for some suitable vector of variable $\vec{G}$ and a suitable assignment of values $\vec{g}$.


According to Counterfactual Counterfactual, causation is not counterfactual dependence; rather, it is counterfactual dependence in some counterfactual scenario, $\vec{G} = \vec{g}$. Assuming that the empty vector of variables counts as suitable, Counterfactual Counterfactual will entail that counterfactual dependence is sufficient for causation.

We will get different theories of causation depending upon which vectors of variables, and which assignments of values, we take to be suitable. For instance, the account of Hitchcock (2001) tells us that $\vec{G}$ and $\vec{g}$ are suitable iff, in $\mathbb{M}$, there is some directed path leading from the variable $C$ to the variable $E$, $C \to V_1 \to V_2 \to \dots \to V_N \to E$, such that, in the counterfactual model $\mathbb{M}[\vec{G} \to \vec{g}]$, every variable $V$ along this path retains its actual value in the original model, $\vec{u}(V)$.


Hitchcock (2001). $\vec{G}$ and $\vec{g}$ are suitable iff there some some path from $C$ to $E$ such that, for every variable $V$ along this path, $$ \mathbb{M}[\vec{G} \to \vec{g}] \models V = \vec{u}(V) $$


There are some cases in which Hitchcock looks too strong (e.g., the Voting Machine case from appendix A.2 of Halpern & Pearl (2005)). These and other cases were taken to motivate a move to the following weaking.


Halpern and Pearl (2005). $\vec{G}$ and $\vec{g}$ are suitable iff, for all vectors of variables $\vec{P}$ not in $\vec{G}$, and any subvector $\vec{H}$ of $\vec{G}$, $$ \mathbb{M}[\vec{H} \to \vec{g}(\vec{H}), \vec{P} \to \vec{u}(\vec{P}), C \to c] \models E=e $$


Notice that, if a counterfactual setting $\vec{G} = \vec{g}$ is suitable according to Hitchcock (2001), then it will automatically be suitable according to Halpern and Pearl (2005). So, if $C=c$ caused $E=e$ according to Hithcock (2001), then $C=c$ caused $E=e$ according to Halpern and Pearl (2005).

Both of these accounts of causation face a problem with cases of what’s come to be known as bogus prevention, illustrated by the neuron diagram below.

In this neuron diagram, $C$'s firing does not prevent $E$ from firing (that is: $C$'s firing did not cause $E$ to not fire). However, both Hitchcock (2001) and Halpern and Pearl (2005) get the verdict that $C$'s firing prevented $E$ from firing. That’s because both of them rule the singleton vector of variables $\vec{G} = (A)$, with the assignment $\vec{g}=1$, suitable. And, in this counterfactual setting, whether $E=0$ counterfactually depends upon whether $C = 1$.

In response to cases like these, there has been further emendation of the Halpern and Pearl account to incorporate standards of normality, or typicality. Halpern (2008) emends the Halpern and Pearl (2005) account like so:


Halpern (2008). $\vec{G}$ and $\vec{g}$ are suitable iff, for all vectors of variables $\vec{P}$ not in $\vec{G}$, and any subvector $\vec{H}$ of $\vec{G}$, $$ \mathbb{M}[\vec{H} \to \vec{g}(\vec{H}), \vec{P} \to \vec{u}(\vec{P}), C \to c] \models E=e $$ and, in addition, there is some assignment of values to the variables in the model such that, in that assignment, $\vec{G} = \vec{g}$ and $C = c'$, and that assignment is at least as normal, or typical as the variable assignment of the original model $\mathbb{M}$.


This definition requires us to outfit our causal models with a ranking over assignments of values to all of the variables in $\mathbb{U} \cup \mathbb{V}$. There will be complicated questions about which variable values are more normal than which others; however, it we restrict our attention to simple neuron diagrams, we can at least rest assured that everybody seems to agree that it is more normal or typical for a neuron to not fire than it is for it to fire. If we assume that $A$'s not firing is more normal that $A$'s firing, then Halpern (2008) tells us that the counterfactual setting $A=1$ in Bogus Prevention is not suitable; and, therefore, that $C=1$ did not cause $E=1$.

Notice that, if a counterfactual setting $\vec{G} = \vec{g}$ is suitable according to Halpern (2008), then it will automatically be suitable according to Halpern and Pearl (2005). So, if $C=c$ caused $E=e$ according to Halpern (2008), then $C=c$ caused $E=e$ according to Halpern and Pearl (2008).

For further discussion of these accounts, see chapters 7 and 8 of my Seminar Notes for Causality

2.Counterfactual Counterfactual Accounts Reverse Causal Judgments in Model Reductions

Recall the Lewisian neuron diagram of a case of preemption.

We may model this neuron diagram with the following system of structural equations (where the variables have the natural interpretation, with $1$ corresponding to firing and $0$ corresponding to not firing):

$$ \begin{aligned} E &:= B \vee D \\\
D &:= C \\\
B &:= A \wedge \neg C \end{aligned} $$

(The context is just $C=1$ and $A=1$.) Let’s call this model “$\mathbb{M}$”. In the original neuron diagram, $C$'s firing is a cause of $E$'s firing. So we should want our theory of singular causation to tell us that, in this causal model, $C=1$ is a cause of $E=1$. Getting cases like this right is non-negotiable for a theory of singular causation. And, fortunately, counterfactual counterfactual accounts like Hitchcock’s (2001) and Halpern and Pearl’s (2005) are capable of saying that $C=1$ is a cause of $E=1$ in the causal model above. To deliver this verdict, those theories let $\vec{G} = (B)$, with $\vec{g} = (0)$. As you may verify for yourself, Hitchcock (2001), Halpern and Pearl (2005), and Halpern (2008) all deem this choice suitable. But then, $$ \mathbb{M}[B \to 0] \models C = 0 \hspace{4pt}\Box\hspace{-4pt}\to E = 0 $$ That is: in the counterfactual scenario where $B$'s value is held fixed at $0$, had $C$ not fired, $E$ would not have fired either. So, Counterfactual Counterfactual deems $C=1$ a cause of $E=1$.

Note that the following is an exogenous reduction of this model in which we have excised the exogenous variable $A$ by substituting $1$ for $A$ in $B$'s structural equation. $$ \begin{aligned} E &:= B \vee D \\\
D &:= C \\\
B &:= \neg C \end{aligned} $$ Call the resulting model “$\mathbb{M}_A$”. The endogenous variable set of $\mathbb{M}_A$ is non-empty and the equation for $B$ is still surjective, so this is a valid exogenous reduction. By our principle Valid Exogenous Reduction Preserves Correctness (see the previous post), $\mathbb{M}_A$ is correct if the original model $\mathbb{M}$ was.

Given $\mathbb{M}_A$, we may excise the endogenous variable $B$ by removing $B$'s structural equation and substituting $\neg C$ for $B$ in $E$'s structural equation. $$ \begin{aligned} E &:= \neg C \vee D \\\
D &:= C
\end{aligned} $$ Call the resulting model “${\mathbb{M}_A}_B$”. $B$ is not a collider in $\mathbb{M}_A$, so this is a valid endogenous reduction of $\mathbb{M}_A$. By our principle Valid Endogenous Reduction Preserves Correctness (see the previous post), ${\mathbb{M}_A}_B$ is correct if $\mathbb{M}_A$ was.

However, just considering this model, Counterfactual Counterfactual tells us that $C=1$ is not cause of $E=1$. For the only possible choices of $\vec{G}$ are the empty vector and the singleton vector $(D)$. Since $E=1$ does not counterfactually depend upon $C=1$, the empty vector does not witness $C=1$'s causing $E=1$. And $$ {\mathbb{M}_A}_B[D \to 1] \models C=0 \hspace{4pt}\Box\hspace{-4pt}\to E=1 $$ So there is no counterfactual dependence between $E=1$ and $C=1$ in the counterfactual scenario in which $D$ is held fixed at $1$. And $$ {\mathbb{M}_A}_B[D \to 0] \models C=0 \hspace{4pt}\Box\hspace{-4pt}\to E=1 $$ So there is no counterfactual dependence between $E=1$ and $C=1$ in the counterfactual scenario in which $D$ is held fixed at $0$. So there is no counterfactual dependence between $E=1$ and $C=1$ period. So they are not causally related, according to Counterfactual Counterfactual.

What we’ve just seen is that, if we accept the principles on valid model reduction from the previous post, then the verdicts of a theory like Counterfactual Counterfactual vary from correct model to correct model. Above, we relied upon both the principle Valid Exogenous Reduction Preserves Correctness and Valid Endogenous Reduction Preserves Correctness. However, we can get Halpern (2008) to flip its verdict by just excising an exogenous variable from a correct causal model.

Consider the following neuron diagram.

Here’s how to read this diagram: if $B$ fires at $t_1$, then it will cancel out any one signal sent from $A$ or $C$. So, if $B$ fires and exactly one of $A$ and $C$ fire, then $E$ will not fire. If $B$ fires and both $A$ and $C$ fire, then $E$ will fire. And, if $B$ doesn’t fire, then $E$ will fire iff at least one of $A$ and $C$ fire.

We can represent this neuron diagram with the following structural equation. $$ E := (\neg B \wedge (A \vee C)) \vee (B \wedge (A \wedge C)) $$ (The context is $A=C=1$ and $B=0$.) Call the causal model containing these variables, this context, and this equation “$\mathbb{M}$”. Given given the natural assumption that not firing is more normal, or typical, than firing, Halpern (2008) tells us that $C$'s firing ($C=1$) is a cause of $E$'s firing ($E=1$) in $\mathbb{M}$. That’s because the variable assignment in which none of the neurons fire is more normal that the actual variable assignment, and this is an assignment in which $A=C=0$. So, Halpern (2008) tells us that the counterfactual setting $A=0$ is suitable; and, in this counterfactual setting, whether $E=1$ counterfactually depends upon whether $C=1$. $$ \mathbb{M}[A=0] \models C = 0 \hspace{4pt}\Box\hspace{-4pt}\to E = 0 $$

So, $C=1$ caused $E=1$. Now, I don’t think that this verdict is a desideratum of a theory of causation. I, like Lewis and Mackie, am content with an account which says that, while neither $A=1$ nor $C=1$ individually caused $E=1$, the disjunction (or the fusion, or what-have-you) of $A=1$ and $C=1$ did. However, I am also content with an account according to which $C=1$ was a cause of $E=1$. And that is how Halpern (2008), like Hitchcock (2001) and Halpern and Pearl (2005), comes down on this case.

Suppose that we excise the exogenous variable $A$ from this model. This gives us the new model $\mathbb{M}_A$, which contains the variables $C, B,$ and $E$, the structural equation $$ E := \neg B \vee C $$ The resulting endogenous variable set is non-empty, and the resulting structural equation is surjective, so this exogenous reduction is valid. By our principle Valid Exogenous Reduction Preserves Correctness, $\mathbb{M}_A$ is correct if $\mathbb{M}$ was.

Now, while Halpern (2008) said that $C=1$ caused $E=1$ in $\mathbb{M}$, it reverses this judgment in $\mathbb{M}_A$. That’s because, in the actual context $B=0$, whether $E=1$ does not counterfactually dependend upon whether $C=1$. And while, in the counterfactual setting $B=1$, whether $E=1$ does counterfactually depend upon whether $C=1$, $$ \mathbb{M}_A[B \to 1] \models C=0 \hspace{4pt}\Box\hspace{-4pt}\to E=0 $$ This counterfactual setting is not suitable, according to Halpern (2008). For having $B$ fire is less normal than having $B$ not fire. (Or, if we reject this normality ranking, for whatever reason, this calls into question whether the account is capable of getting the right verdict in Bogus Prevention.)

So, again, if we accept the principle that Valid Exogenous Reduction Preserves Correctness, then the verdicts of Halpern (2008) vary from correct model to correct model.

2017, Oct 1

when can variables be safely removed from a causal model?

Much of our causal talk consists of sentences of the form “c caused e”, where both c and e are token, non-repeatable events or facts or what-have-you (there will be disagreement about what kinds of things ‘c’ and ‘e’ denote, but for now, I’ll just call them ‘events’). Let’s call the kinds of causal relations we’re talking about with sentences like those ‘singular causal relations’. The topic of causation is not exhausted by singular causal relations. There are other interesting causal notions which are clearly distinct from (though they may bear interesting relations to) singular causation. For instance, “Smoking causes cancer” is not a singular causal claim, but rather a general causal claim, relating not token events but rather general types of events.

Many have become convinced that the best way to theorize about singular causation is by understanding it withing the context of some explicitly represented system of causal determination. Causal determination is a third causal notion, distinct from both singular and general causation. Even though it is incorrect to say that the power being on caused the light to not be illuminated, it is nevertheless true that whether the light is illuminated is causally determined by whether the power is on and the switch is up. And, though one could infer that the power is on from the fact that the light is illuminated, it would be incorrect to say that the whether the power is on is causally determined by whether the light is illuminated. To say this would be to get the direction of causal determination the wrong way ‘round.

Those who think that systems of causal determination have an important role to play in a theory of singular causation typically think that these systems of causal determination may be represeted with systems of structural equations. A system of structural equations is a particular kind of model of a network of causal determination. The model consists of a vector of variables (see section 1 of this post for more on how to think about variables), together with a vector of structural equations. For instance, we may introduce a variable $P$ for whether the power is on (at the relevant place and the relevant time). This variable takes on the value $1$ if the power is on, takes on the value $0$ if the power is off, and is undefined otherwise. We may similarly introduce a variable $L$ for whether the light is illuminated—a variable which takes on the value $1$ if the light is illuminated, takes on the value $0$ if the light is not illuminated, and takes on the value $0$ otherwise (if, e.g., the light doesn’t exist). And we may introduce a variable $S$ for whether the switch is up or not (again, $1$ if it’s up, $0$ if it’s down, undefined otherwise). A structural equation then tells us how the value of $L$ is causally determined by the values of $P$ and $S$. In particular, it tells us that \begin{equation}
L := P \wedge S \label{1}\tag{1} \end{equation} (Here, “$\wedge$” is just the truth-function ‘and’.) So, $L$ will take on the value $1$ iff both $P$ and $S$ take on the value $1$. If either $P$ or $S$ is $0$, then $L$ will take on the value $0$ as well. When we combine multiple strutctural equations, we can get a system of structural equations. These systems of equations represent networks of causal determination out in the world. For instance, suppose that whether the power is on is structurally determined by whether the light switch is up. If the light switch is up, then the power is on, and if the light switch is down, then the power is off.
\begin{equation} P := S \tag{2}\label{2} \end{equation} Combining the structural equations \eqref{1} and \eqref{2} gives us a system of structural equations \begin{aligned} L &:= P \wedge S \\\
P &:= S
\end{aligned}

What makes this system of equations structural is that we are interpreting them causally. The equations don’t just say that there is a certain relationship between the values of $L, P,$ and $S$. They additionally says that the value of $L$ is causally determined by the values of $P$ and $S$; and that the value of $P$ is causally determined by the value of $S$. It is for this reason that I use the asymmetric relation “$:=$”, and not the symmetric relation “$=$”. For instance, it follows from the system of equations consisting of \eqref{1} and \eqref{2} that $S=1$ iff $L=1$; so the equation $$ S = L $$ will be true if the system of strutural equations $($\eqref{1}, \eqref{2}$)$ is. However, it will be false that $$ S := L $$ For, even though the value of $S$ must match the value of $L$, it is not the case that the value of $S$ is causally determined by the value of $L$. It is this additional information which is conveyed by the symbol $:=$.

In a structural equation, there is exactly one, dependent variable on the left-hand-side of the equation, and at least one independent variable on the right-hand-side. I’ll use “$\mathbf{PA}(V)$” to represent a vector of the independent variables on the right-hand-side of $V$'s structural equation. (It is common to refer to these variables as $V$'s causal parents.) Then, a structural equation is of the form $$ V := \phi_V(\mathbf{PA}(V)) $$ where $\phi_V$ is some function from the values of the variables in $\mathbf{PA}(V)$ to the values of $V$. I will insist, by the way, that $\phi_V$ be surjective if we are to interpret it causally. I will use “$\phi_V$” to represent the entire structural equation $V := \phi_V(\mathbf{PA}(V))$. If a variable appears on the left-hand-side of a structural equation, then that variable is endogenous. Otherwise, it is exogenous.

What I will call a causal model, $\mathbb{M}$, consists of a vector of exogenous variables $\mathbb{U}$, a vector of endogenous variables $\mathbb{V}$, a vector of structural equations $\mathbb{E}$, and a context, $\vec{u}$, which is an assignment of values to the exogenous variables in $\mathbb{U}$.


Causal Model A causal model $\mathbb{M}$ is a 4-tuple $$ \mathbb{M} = \langle \mathbb{U}, \mathbb{V}, \mathbb{E}, \vec{u} \rangle $$ of

  1. A (non-empty) vector $\mathbb{U}$ of exogenous variables, $( U_1, U_2, \dots, U_M )$.
  2. A (non-empty) vector $\mathbb{V}$ of endogenous variables, $ ( V_1, V_2, \dots, V_N)$.
  3. A vector $\mathbb{E}$ of structural equations, $ ( \phi_1, \phi_2, \dots, \phi_N) $, one for each endogenous variables $V_i \in \mathbb{V}$.
  4. A context $\vec{u} = ( u_1, u_2, \dots, u_M )$, which assigns a value to each exogenous variable $U_i \in \mathbb{U}$.

(This is a slightly non-standard presentation. Normally, the context is not taken to be a part of the causal model.)

Given a causal model, we may generate a causal graph by creating a node for every variable and placing an arrow (a directed edge) between two variables $U$ and $V$, with its tail at $U$ and its head at $V$, $U \to V$, iff $U$ appears on the right-hand-side of $V$'s structural equation. For instance, the causal model of the light, the power, and the switch, determines this causal graph:

For a more careful and thorough introduction to causal models, and a theory of when they are correct—that is, when they correctly represent relations of causal determination out in the world—see section 2 of this paper.

When Removing Exogenous Variables Preserves Correctness

Suppose that we have the causal model introduced above, with the context $S=1$ (the switch is actually up). It appears that we can excise the exogenous variable $S$ from this model entirely. We may simply take $S$'s actual value $1$ and plug it into all the structural equations in which the variable $S$ appeared. When we do this, the structural equation associated with $P$ no longer depends upon any variables, and simply says that $P := 1$. That is: the effect of removing the exogenous variable $S$ has been to render $P$ exogenous. And, when we remove the exogenous $S$, the structural equation associated with $L$ becomes $L := P \wedge 1$, or just $L := P$.

We therefore get the causal model $\mathbb{M}$ with the exogenous variable $S$ excised. This is $$ \mathbb{M}_{S} = \langle (P), (L), (L := P), (1) \rangle. $$ That is: $\mathbb{M}_S$ consists of the vector of exogenous variables $\mathbb{U} = (P)$, the vector of endogenous variables $\mathbb{V}=(L)$, the vector of structural equations $\mathbb{E} = (L := P)$, and the exogenous assignment $\vec{u} = (1)$ to $P$.

I think that, if the original model $\mathbb{M}$ was correct, then so too is $\mathbb{M}_S$. This follows from a counterfactual understanding of what makes a causal models correct, since the counterfactuals entailed by the new model $\mathbb{M}_S$ are a proper subset of the counterfactuals entailed by the old model $\mathbb{M}$. Given some plausible assumptions, it also follows from my own preferred way of understanding what makes a causal model correct.

No causal model represents all of the features of reality which could potentially make a difference with respect to the values of the variables in the model. In every causal model, we will be taking for granted certain features of, or causal precursors to, the system being modeled. If I want to model the causal determinants of the forest fire, I needn’t explicitly include a variable for the presense of oxygen. So long as there is plenty of oxygen in the atmosphere, it may be true that whether there is a fire is causally determined by whether the lightning struck. Similarly, so long as the light switch is actually up, whether the light is illuminated is causally determined by whether the power is on or off.

In general, if $U$ is an exogenous variable in the causal model $\mathbb{M}$, we can define the $U$-reduction of $\mathbb{M}$ to be what you get when you remove $U$ from $\mathbb{U}$, put into $\mathbb{U}$ any variables in the model which were causally determined by $U$ alone (and remove those variables from $\mathbb{V}$), replace $U$ for its value in the context $\vec{u}$ within every equation in $\mathbb{E}$ (except, of course, for those endogenous variables $V$ which were causally determined by $U$ alone), and update the context $\vec{u}$ appropriately.


Exogenous $U$-Reduction. Given a causal model $\mathbb{M} = \langle \mathbb{U, V, E}, \vec{u} \rangle$, and some $U \in \mathbb{U}$, the $U$-reduction of $\mathbb{M}$, $\mathbb{M}_U$, is $\langle \mathbb{U}_U, \mathbb{V}_U, \mathbb{E}_U, \vec{u}_U \rangle$, where

  1. $\mathbb{U}_U$ is the vector of previously exogenous variables, minus $U$, and plus any endogenous variables whose values were determined by $U$ alone.
  2. $\mathbb{V}_U$ is the vector of previously endogenous variables, minus any whose values were determined by $U$ alone.
  3. $\mathbb{E}_U$ is a vector of structural equations. For each endogenous variable $V$ in $\mathbb{V}_U$, there is exactly one structural equation, which is the result of taking $V$'s old structural equation in $\mathbb{E}$, and replacing the variable $U$ wherever it appears (if at all) with $U$'s value in the context $\vec{u}$
  4. $\vec{u}_U$ is an assignment of values to the variables in $\mathbb{U}_U$ which matches $\vec{u}$ for all exogenous variables previously in $\mathbb{U}$; for those newly exogenous variables, $V$, the assignment in $\vec{u}_U$ is the one determined by taking $V$'s old structural equation in $\mathbb{E}$ and replacing the variable $U$ with $U$'s value in the old context $\vec{u}$.

However, while we can safely remove the exogenous variable $S$ from $\mathbb{M}$ in the context $S=1$, we cannot remove $S$ in the context $S=0$. If we try to do so, we will end up with the structural equation $L := P \wedge 0$. But this equation tells us that $L$'s value does not depend upon $P$'s value. No matter what value $P$ takes on, $L$ will take on the value $0$. So the resulting model would say, falsely, that there $P$ does not causally determine $L$.

The right way to think about this, I believe, is that some $U$-reductions will lead to models which violate necessary conditions on the correctness of causal models. In particular, in order for a structural equation $\phi_V$ to be correct, every value of the right-hand-side variable $V$ must be in the image of $\phi_V$. That is to say: only surjective functions may appear in correct structural equations. And, in order for a causal model to be correct, all of the structural equations it contains must be correct. So, in the context $S=0$, removing the exogenous variable $S$ renders the structural equation $L := P \wedge 0$ non-surjective. Such $U$-reductions are not valid.

Similarly, in order for a causal model to be correct, the vector of endogenous variables $\mathbb{V}$ must be non-empty. Some $U$-reductions will violate this necessary condition on correctness. For instance, consider the $S$-reduced model discussed above. If we try to $U$-reduce this model by excising the exogenous variable $P$, the resulting model, $\mathbb{M}_{S, P}$, will have no endogenous variables. $U$-reductions like these are not valid, either.

In general, we may say that a $U$-reduction is valid iff (1) the resulting endogenous variable set is non-empty, and (2) the resulting structural equations are all surjective.

If a $U$-reduction is valid, then the $U$-reduced model is correct if the original model was. Valid $U$-reduction preserves correctness.


Valid Exogenous Reduction Preserves Correctness. If $\mathbb{M}$ is a correct causal model, and $\mathbb{M}_U$ is a valid exogenous $U$-reduction of $\mathbb{M}$ (i.e., if $\mathbb{M}_U$ is both correct and a $U$-reduction of $\mathbb{M}$), then $\mathbb{M}_U$ is a correct causal model, too.


I have previously laid down conditions for the correctness of causal models. Valid $U$-Reduction Preserves Correctness is not intended as a conjecture about those correctness conditions. I know that my account, as it stands now, violates this principle (the curious may consider the $H$-reduction of the causal model in figure 8 of that paper). Valid $U$-Reduction Preserves Correctness is intended to supplement that account. The principle allows you to move from the correctness of one causal model to the correctness of a certain sub-model, even if the sub-model was not previously deemed correct on its own.

When Removing Endogenous Variables Preserves Correctness

Go back to our original causal model of the light switch, the power, and the light, \begin{aligned} L &:= P \wedge S \\\
P &:= S
\end{aligned} Just as it appeared that we could excise the exogenous variable $S$ from this model, so too does it appear that we may excise the variable $P$ from this model. Since we know that the power turns on whenever the light switch is on; and since we know that, if both the power and the switch are on, the light will be illuminated, it appears that we may conclude straightaway that, if the switch is on, then the light will be illuminated. Moreover, the switch’s being on appears to causally determine the light’s being illuminated. So it seems that, if the original causal model was correct, then so too should be the model $$ \mathbb{M}_P = \langle (S), (L), (L := S), (1) \rangle $$ This is the model containing the sole exogenous variable $S$, the sole endogenous variable $L$, the sole structural equation $L := S$, and the exogenous assignment $1$ to $S$. Call this model the endogenous $P$-reduction of $\mathbb{M}$. We got $\mathbb{M}_P$ from $\mathbb{M}$ by simply replacing the variable $P$ in $L$'s structural equation with the right-hand-side of $P$'s structural equation, giving $L := S \wedge S$. And this function is equivalent to $L := S$.

What’s more, it appears as though we can carry out this endogenous reduction of $\mathbb{M}$ whatever the value of $S$ happens to be. Even if $S = 0$, it will still be the case that $L$'s value will be causally determined to match $S$'s value.

In general, if $V$ is an endogenous variable in the causal model $\mathbb{M}$, we can define the $V$-reduction of $\mathbb{M}$ to be what you get when you remove $V$ from $\mathbb{V}$, and replace $V$, every time it appears on the right-hand-side of a structural equation, with the right-hand-side of $V$'s own structural equation.


Endogenous $V$-Reduction. Given a causal model $\mathbb{M} = \langle \mathbb{U, V, E}, \vec{u} \rangle$, and some $V \in \mathbb{V}$, the $V$-reduction of $\mathbb{M}$, $\mathbb{M}_V$, is $\langle \mathbb{U}, \mathbb{V}_V, \mathbb{E}_V, \vec{u} \rangle$, where

  1. $\mathbb{V}_V$ is the original vector of endogenous variable $\mathbb{V}$, minus the variable $V$.
  2. $\mathbb{E}_V$ is just like the original vector of structural equations, except that it is lacking $V$'s structural equation $V := \phi_V( \mathbf{PA}(V) )$, and every occurrence of $V$ on the right-hand-side of the remaining equations is replaced with $\phi_V( \mathbf{PA}(V) )$.

While we can safely remove the endogenous variable $P$ in our model of the light and the switch, we may not always do this. While some endogenous $V$-reductions are valid, other are not. For instance, consider the Lewisian neuron diagram shown below.

The neuron diagram displays a case of what’s known in the literature as early preemption. Neuron $A$'s firing would have caused $E$ to fire, but it was preempted by neuron $C$'s firing. As things actually shook out, it was $C$, and not $A$, that caused $E$ to fire. I’ll suppose that this neuron diagram may be represented with a causal model containing a binary variable for every neuron, where those variables take the value $1$ if the neuron fires at its designated time, and takes the value $0$ if the neuron does not fire at its designated time. Then, we will end up with the following system of structural equations.

$$\begin{aligned} E &:= B \vee D \\\
B &:= A \wedge \neg C \\\
D &:= C \end{aligned}$$

The endogenous $D$-reduction of this causal model is

$$\begin{aligned} E &:= B \vee C \\\
B &:= A \wedge \neg C \\\
\end{aligned}$$

And the endogenous $B$-reduction of this causal model is

$$ E := (A \wedge \neg C) \vee C $$

Or, equivalently,

$$ E := A \vee C $$

But this model treats $A$ and $C$ symmetrically. And both $A$ and $C$ take on the value $1$. This means that any theory of singular causation which looks only at the patterns of counterfactual dependence in a causal model (including, perhaps, information about which variable values are default and which are deviant) will, when applied to this model, say that $A=1$ caused $E=1$ iff $C=1$ caused $E=1$. But this would be a disasterous result—for $A=1$ did not cause $E=1$; while $C=1$ did cause $E=1$.

Lesson: if we want to use correct causal models to uncover relations of singular causation, then we had better not think that endogenous reduction always preserves correctness.

A similar lesson follows when we look at cases of preemptive prevention like the one shown below.

Here, $B$'s firing prevents $E$ from firing. However, had $B$ not fired, $A$ would have prevented $E$ from firing. So, $B$'s firing preempted $A$'s prevention. We can represent this neuron diagram with the following system of equations (where the variables are given the natural interpretation, and take on the value $1$ if the assocaited neuron fires, and take on the value $0$ if the associated neuron does not fire).

$$\begin{aligned} E &:= C \wedge \neg (B \vee D ) \\\
D &:= A \wedge \neg B \end{aligned}$$

The endogenous $D$-reduction of this causal model gives the sole structural equation $$ E := C \wedge \neg (B \vee (A \wedge \neg B)) $$ Or, equivalently, $$ E := C \wedge \neg A \wedge \neg B $$ However, this reduced model treats $A$ and $B$ symmetrically, and both $A$ and $B$ take on the value $1$; any theory which looks only at patterns of counterfactual dependence in correct causal models will therefore say that $A=1$ prevented $E=1$ iff $B=1$ prevented $E=1$. But $B=1$ prevented $E=1$ while $A=1$ did not. So, again, if we want to use correct causal models to uncover relations of singular causation, then we had better not think that endogenous reduction always preserves correctness.

I’d like to suggest that precisely the same thing goes wrong in both of the foregoing cases of endogenous variable reduction. In the first case—the case of preemption—the endogenous $B$-reduction took us to a model in which $E$'s value is determined directly by both $A$ and $C$. In the associated causal graph of the $B$-reduced model, there is one arrow leading from $A$ to $E$, and another arrow leading from $C$ to $E$. The model presents these causal pathways as autonomous, with both $A$ and $C$ determining $E$'s value in a way that is independent of the other’s influence. However, $A$'s determination of $E$'s value is not autonomous of $C$'s. In fact, both $A$ and $C$ determines $E$'s by way of a common variable, $B$.

Similarly, in the case of preemptive prevention, endogenous $D$-reduction brought us to a model in which $E$'s value is determined directly and autonomously by both $A$ and $B$. But the way that $A$ determines $E$'s value is not autonomous of the way that $B$ determines $E$'s value. In fact, both $A$ and $B$ determine $E$'s value by way of a common variable, $D$.

In the original causal models, variables like $B$ (in Preemption) and $D$ (in Preemptive Prevenvtion) are called colliders. What makes a variable in a causal model a collider is that there are two distinct arrows leading into that variable. Equivalently, a variable is a collider iff it has more than one causal parent. (Note: “collider” is usually defined to be a path-relative notion; as I’m using the notion here, a variable is a collider iff it is a collider along some path or other.)

Reflection on cases like the foregoing leads to the following constraint on valid endogenous $V$-reduction: if the endogenous variable $V$ is a collider, then $V$-reduction is not valid. Colliders may not be removed in the manner specified in Endogenous $V$-Reduction. Now, I believe that this is the only constraint on valid endogenous reduction. So long as $V$ is not a collider, $V$ may be excised from the causal graph in the manner specified in Endogenous $V$-Reduction.

Moreover, I believe this to be the only constraint on valid endogenous reduction. So we may say that, in general, a $V$-reduction is valid iff $V$ is a non-collider.

If a $V$-reduction is valid, then theh $V$-reduced model is correct if the original model was. Valid $V$-reduction preserves correctness.


Valid Endogenous Reduction Preserves Correctness. If $\mathbb{M}$ is a correct causal model, and $\mathbb{M}_V$ is a valid $V$-reduction of $\mathbb{M}$ (i.e., if $V$ is not a collider in $\mathbb{M}$), then $\mathbb{M}_V$ is a correct causal model, too.


In the case of exogenous $U$-reduction, the corresponding principle (Valid Exogenous Reduction Preserves Correctness) carried with it a genuine extension of the conditions for correctness of causal models which I endorsed previsouly. In the case of endogenous $V$-reduction, reflection on cases like preemption and preemptive prevention call for a corresponding constriction in those conditions. There are causal models, like the $D$-reduced model of figure 3, which my previous account deems correct but which are not correct. Valid Endogenous Reduction Preserves Correctness does not yet rule out models like those. To do so, we should additionally endorse:


Invalid Endogenous Reduction Destroys Correctness. If there is some correct causal model $\mathbb{M} = \langle \mathbb{U, V, E}, \vec{u} \rangle$, with a collider $V \in \mathbb{V}$, then the $V$-reduction of $\mathbb{M}$, $\mathbb{M}_V$, is not correct.



2017, Sep 23

When do variables overlap?

I spent the past two days preparing comments on a very interesting paper by Vera Hoffmann-Kolss for the upcoming Society for the Metaphysics of Science meeting. Thinking through the paper got me freshly confused about some matters that I had thought settled, and so I thought I’d write up a blog post on those confusions in an attempt to sort them out.

It’s tempting to think that counterfactual dependence suffices for causation. But this can’t be quite right. I both played cards and played poker. Had I not played cards, I wouldn’t have played poker. So there is counterfactual dependence between my playing poker and my playing cards. But my playing cards didn’t cause me to play poker. The relationship between my playing cards and my playing poker is constitutive, not causal.

Sophisticated counterfactual theories of causation, therefore, do not say that counterfactual dependence suffices for causation. Rather, what they say is that counterfactual dependence between distinct events suffices for causation. By ‘distinct’, we mean a bit more than ‘non-identical’. The event of my playing poker is not identical to the event of my playing cards. (If you doubt this, note that they differ causally. I played poker because I didn’t have a pinochle deck—I usually play pinochle. But I certainly didn’t play cards because I didn’t have a pinochle deck.) Rather, ‘distinct’ in this context means something more like ‘not logically related’. If two events are not distinct, then let’s say that they overlap.

Worries about overlap plague other theories of causation, too. My playing poker is a minimally sufficient condition for my playing cards, so—unless overlapping conditions are specifically excluded—Mackie’s account of causation will deem them causally related.

Today, I’ll be exploring this problem as it plays out for those who, like myself, think that the causal relata are variable values. For such theorists, the problem is to say when variables are distinct, and when they overlap.

1. Lewisian Events and Variables

Let me begin by getting clearer about what I mean by a ‘variable’. When I’m at my most careful and pedantic, I like to think of variables as generalized Lewisian events. Lewis thought that an event was a property of a spacetime region. For Lewis, a property is just a class of individuals at worlds—intuitively, the class of individuals possessing the property at those worlds. Thus, for Lewis, an event is just a class of spacetime regions at worlds.

Given a Lewisian event, we may construct a function, $e$, from regions to $\{ 1, \ast \}$. $e ( R )=1$ if the event occurs within $R$, and $e ( R ) =\ast$ otherwise. Here, I follow Lewis in distinguishing regions within which an event occurs from regions in which an event occurs. An event occurs in at most one region in any world. However, it occurs within every region which contains the region in which it occurs. For every world, there will be a worldly region which contains all the regions at that world. If and only if an event occurs at a world $\omega$, the function $e$ will map the worldly region $R_\omega$ to $1$. Given the class of regions at $\omega$ _within_ which an event occurs, we may recover the region _in_ which it occurs at $\omega$ by simply taking their intersection. So, just as we may go from a Lewisian event to one of these functions, we may go from one of these functions back to a Lewisian event. Lewisian events, then, are equivalent to a function from regions to $\{ 1, \ast \}$.


Lewisian Events

A Lewisian event, $e$, is a function from spacetime regions at worlds to $\{1, \ast \}$. $e ( R ) = 1$ iff $e$ occurs within the region $R$, and $e ( R ) = \ast$ otherwise.


This is a characterization, not a definition. Not just any function from regions to $\{1, \ast\}$ will count as a Lewisian event. For instance, no event occurs in more than one spacetime region at any one world. Lewis rules out events which occur in only one possible world—that is, events which map only a single wordly region to $1$. He also rules out events which are too gerrymandered—e.g., any event which is essentially “a fiddling in the presence of a boy whose grandson will first set foot on the moon” (p.~257). Some unified account of which events exist and which do not would be nice, but Lewis has none to offer.

We can understand a Lewisian variable as the contrastive generalization of a Lewisian event. A variable $V$ is a function from regions to $\mathbb{R} \cup \{ \ast \}$. If $V( R) = v$, then $v$ is the value the variable $V$ takes on within the region $R$. If $V( R) = \ast$, then the variable $V$ is undefined within $R$. As with events, we should distinguish those regions within which a variable takes on a value from those regions in which it takes on a value. Like events, variables take on a value in at most one region per world, but take on a value within every region containing the region in which it takes on a value. As with events, at each world, we may take the intersection of all regions within which a variable takes on a value to recover the region in which it takes on that value.


Lewisian Variables

A Lewisian variable, $V$, is a function from spacetime regions at worlds to $\mathbb{R} \cup \{ \ast \}$. If $V ( R ) = v$, then $v$ is the value the variable takes on within the region $R$, and if $V ( R ) = \ast$, then the variable is undefined within the region $R$.


This, too, is a characterization and not a definition. Not just any function from wordly regions to $\mathbb{R} \cup \{ \ast \}$ counts as a variable. As in the case of events, we should assume that variables take on a value in at most one region per world, and we should rule out certain too gerrymandered functions from regions to $\mathbb{R} \cup \{\ast\}$. There is no variable which takes on the value $x$ within exactly those worlds where my right earlobe is $x$ meters from the last spot the last descendant of Napoleon ever left their glasses, and in the region where Stanley Kubrick first dreamt. As with events, it would be nice to have a precise characterization of which variables exist and which do not, but I have none to offer.

As a helpful shorthand, we may allow ourselves to write “$e$” for the set of regions which get mapped to $1$ by the event $e$. $$ e \,\,:=\,\, \{ R \mid e( R) = 1 \} $$ And we may allow ourselves to write “$V=v$” for the set of regions which get mapped to $v$ by the function $V$. $$ V=v \,\,:=\,\, \{ R \mid V( R) = v \} $$ This is a helpful, and mostly harmless, bit of notation, but for reasons I’ll discuss in the next paragraph, strictly speaking we should not conflate the set $\{ R \mid V( R) = v \}$ with the variable value $V=v$. Some other bits of notation: I’ll use “$\mathscr{R}(V)$” for the range of the variable $V$—that is, the set of all real numbers to which $V$ maps some region, $ \{ v \in \mathbb{R} \mid \exists R : V( R) = v \} $. I’ll use a boldfaced “$\mathbf{v}$” for a set of values of $V$, and I’ll therefore use “$V \in \mathbf{v}$” for the set of regions which get mapped to a value within $\mathbf{v}$, $\{ R \mid V(R ) \in \mathbf{v} \}$.

Notice that, given these characterizations, a Lewisian event is just a singly-valued Lewisian variable. A multiply-valued Lewisian variable is what we may call a proper Lewisian variable. I say that proper Lewisian variables are the contrastive generalization of Lewisian events. Why ‘contrastive’? Consider an event like Susan’s stealing the bicycle. This is a function which maps regions to $1$ iff they contain Susan stealing the bicycle, and $\ast$ otherwise. This event may be embedded in two different variables. Firstly, consider a variable we may call whether Susan steals. This variable takes on the value $1$ for regions in which Susan steals the bicycle, takes on the value $0$ for regions in which Susan buys the bicycle, and takes on the value $\ast$ otherwise (e.g., for regions which don’t contain Susan, or in which Susan steals something other than the bicycle). Secondly, consider a variable we may call what Susan steals. This variable takes on the value $1$ for regions in which Susan steals the bicycle, takes on the value $0$ for regions in which Susan steals the moped, and takes on the value $\ast$ otherwise. Both of these variables take on the value $1$ iff the event of Susan’s stealing the bicycle occurs. However, the variable value whether Susan steals $= 1$ is different from the variable value what Susan steals $=1$. The difference between them is akin to the difference between the sentences

  1. Susan stole the bicycle (rather than paying for it).
  2. Susan stole the bicycle (rather than the moped).

One way of making sense of sentences like (1) and (2) is that (1) presupposes that Susan either stole or paid for the bicycle, and asserts that she stole it; whereas (2) presupposes that Susan either stole the bicycle or the moped, and asserts that she stole the bike. Though (1) and (2) assert the same thing, they differ in their presuppositions. Similarly, though the variables whether Susan steals and what Susan steals take on the value $1$ in precisely the same regions, they differ with respect to their presuppositions. whether Susan steals presupposes that Susan either stole the bike or paid for it, while what Susan steals presupposes that Susan either steals the bike or the moped. This difference in presupposition makes for a difference in variable value. The variable value whether Susan steals $=1$ is a different variable value than what Susan steals $=1$. This is for the good, since whether Susan stole caused her arrest, but what she stole did not. (See Dretske (1977)) Since we want our theory of causation to mark this difference, and since it would be preferable to not have to increase the arity of the causal relation, it is good that our causal relata have this contrastive character.

It is for this reason that we should be careful to distinguish the set $\{ R \mid V( R) = v \}$ from the variable value $V=v$. If we did not distinguish them, then the variable value whether Susan steals $=1$ would be identical to the variable value what Susan steals $=1$. Compare: it is common to model propositions as functions from possible worlds to truth-value. If we assume that these functions are total, then there is no harm in shifting back and forth between the functions and the set of worlds which get mapped to ‘true’. Given that the functions are total, these representations are equivalent. One method for representing propositions with presuppositions in this framework is to make the corresponding functions partial. Worlds at which the presupposition fails are not mapped to any truth-value. Once this change is made, we must be careful to distinguish a proposition from the set of worlds at which it is true; we may go from the former to the latter, but not from the latter back to the former. And the situation is parallel when we move from taking events to be the causal relata to taking variable values to be the causal relata. A variable value presupposes a set of disjoint events, and singles out one of them as occurrent. Thus, from each variable value, we get a corresponding event; but we cannot get from an event back to a corresponding variable value. In what follows, I will use expressions like “$V=v$” to stand for classes of regions at worlds, but we should bear in mind that this is a simplification which is harmless for present purposes, but could quickly become harmful in others.

2. Lewisian Overlap

Now that we’re clear on what variables are (or at least, what I think they should be, for the purposes of constructing a theory of causation), let’s think through when we should say that variables overlap, and when we should say that they are distinct.

2.1. Overlapping Events

Since variables are just the contrastive generalization of events, a nice place to start is with Lewis’s theory of when events overlap, and when they are distinct. To begin with, let’s say that an event $e$ implies $f$ iff every region containing the event $e$ also contains the event $f$, or $e \subseteq f$. We can then present the Lewisian account of when events overlap as follows:


Overlapping Events

Two events, $e$ and $f$, overlap if:

E1) $e$ implies $f$, $$ e \subseteq f $$

E2) $f$ implies $e$, $$ f \subseteq e $$ or

E3) there is some event, $i$, which is implied by both $e$ and $f$, $$ e \subseteq i \quad \text{ and } \quad f \subseteq i $$


In (E3), we can think of the event $i$ as an event which lies at the intersection of $e$ and $f$. Here, the set theoretic notation can be misleading—keep in mind that, to say that $e \subseteq i$ is to say that any region which contains $e$ also contains $i$; and to say that $f \subseteq i$ is similarly to say that any region which contains $f$ also contains $i$. So $i$ is an event which sits (necessarily) at the intersection of the events $e$ and $f$. If there is such an intersective event, then $e$ and $f$ overlap.

(Parenthetically, because of the superficial differences between my presentation of Lewisian events and Lewis’s own, my use of the term “implies” differs from Lewis’s. Nevertheless, Overlapping Events follows from the sufficient conditions for overlap which Lewis offers in section 5 of Events. At the end of this post, I offer a proof of this fact.)

Lewis introduces condition (E3) because of cases like the following (originally from Kim): I write out the name “Larry” on the whiteboard. In so doing, I write out the letters “Larr”, and I write the letters “rry”. Had I not written the letters “Larr”, I would not have written the letters “rry”. But this dependence is logical, and not causal. Neither event on its own implies the other, so conditions (E1) and (E2) on their own will not tell us that these events overlap. However, (E3) will do the job, since there is the event of writing the letters “rr”. Any region within which I write “Larr” is a region within which I write “rr”; and any region within which I write “rry” is a region within which I write “rr”. So (E3) allows us to correctly rule that my writing “Larr” overlaps with my writing “rry”.

Actually, once we have condition (E3) of Overlapping Events, we no longer have any need for conditions (E1) or (E2). That’s because (E1) is just the special case of (E3) where $i=f$, and (E2) is just the special case of (E3) where $i=e$.

2.2 Overlapping Variables

Generalizing these conditions to variables, we may give the following sufficient conditions for variables $U$ and $V$ overlapping.


Overlapping Variables Two variables, $U$ and $V$, overlap if:

V1) some value of $U$ implies something non-trivial about the value of $V$, $$ \exists u \in \mathscr{R}(U) ,,, \exists \mathbf{v} \subsetneq \mathscr{R}(V) \quad U=u \subseteq V \in \mathbf{v}$$

V2) some value of $V$ implies something non-trivial about the value of $U$, $$ \exists v \in \mathscr{R}(V) ,,, \exists \mathbf{u} \subsetneq \mathscr{R}(U) \quad V=v \subseteq U \in \mathbf{u}$$ or

V3) there is some variable, $I$, about whose values both some value of $U$ and some value of $V$ imply something non-trivial, $$ \exists u \in \mathscr{R}(U) \,\,\, \exists \mathbf{i} \subsetneq \mathscr{R}(I) \quad U=u \subseteq I \in \mathbf{i} $$ and $$ \exists v \in \mathscr{R}(V) \,\,\, \exists \mathbf{i} \subsetneq \mathscr{R}(I) \quad V=v \subseteq I \in \mathbf{i} $$


This isn’t the most obvious generalization of Overlapping Events. In place of (V1), we might instead have said, “some value of $U$ implies some value of $V$”. This condition would have been strictly weaker, in the sense that it would have classified strictly fewer pairs of variables as overlapping. For illustration, suppose that both $U$ and $V$ are ternary variables which, for any spacetime region $R$, either take on the value $\ast$ or one of the following pairs of values.

$$ \begin{array}{l | c c c c c c} V & 1 & 2 & 2 & 0 & 0 & 1 \\\
U & 0 & 0 & 1 & 1 & 2 & 2 \end{array} $$

In this case, no value of $U$ implies any value of $V$ (nor does any value of $V$ imply any value of $U$). Nevertheless, every value of $U$ does imply something non-trivial about the value of $V$. Necessarily, if $U(R ) = u$, then $V(R ) \neq u$, for $u \in \{ 0, 1, 2 \}$. Symmetrically, every value of $V$ implies something non-trivial about the value of $U$. Necessarily, if $V( R)=v$, then $U( R) \neq v$, for $v \in \{ 0, 1, 2 \}$. There is clearly a logical relationship between $U$ and $V$, though the weaker formulation “some value of $U$ implies some value of $V$” wouldn’t allow us to detect it, and so I think it makes sense to opt for my stronger formulation of Overlapping Variables.

Notice that, since events are just singly-valued variables, (V1), (V2), and (V3) also give sufficient conditions for the overlap of events. In this special case, they reduce back to the Lewisian conditions (E1), (E2), and (E3).

We saw above that, once we have condition (E3) of Overlapping Events, we get conditions (E1) and (E2) for free. The same is true of condition (V3) of Overlapping Variables. For (V1) is just the special case of (V3) in which $I = V$, and (V2) is just the special case of (V3) in which $I = U$. Moreover, condition (V2) is redundant, once we have condition (V1). If some value of $V$ implies something non-trivial about the value of $U$, $V=v \subseteq U \in \mathbf{u}$, then there must be some value of $U$, $u^* \notin \mathbf{u}$, such that $U = u^* $ implies that $V \neq v$. So there must be some value of $U$ which implies something non-trivial about the value of $V$.

3. Woodwardian Overlap

Jim Woodward puts forward a necessary condition for variable distinctness (and therefore, a sufficient condition for variable overlap) called independent fixability. Two variables $U$ and $V$ are independently fixable iff, for every value $u \in \mathscr{R}(U)$ and every value $v \in \mathscr{R}(V)$, it is possible to set $U$ to $u$ via an intervention while setting $V$ to $v$ via an intervention. I’d prefer to not invoke Woodward’s technical notion of an intervention if I don’t have to; and fortunately, it follows from $U$ and $V$ being independently fixable that it is possible that $U = u$ and $V = v$, for every pair of values $u$ and $v$. Thus, Woodward’s independent fixability entails the following sufficient condition for variable overlap:


Incompossible Values The variables $U$ and $V$ overlap if

IV) there is some value $u \in \mathscr{R}(U)$ and some value $v \in \mathscr{R}(V)$ such that there is no possible world within which $U=u$ and $V=v$.


Now, it’s interesting to note that (IV) is equivalent to condition (V1) from Variable Overlap. (I prove this at the end of the post.) If we think that Incompossible Values is strong enough to reveal all cases of variable overlap, then, we should think that condition (V1) is all that’s required, and that condition (V3) is too strong.

This is what I thought until recently. I learned better from Hoffmann-Kolss’s paper, mentioned at the beginning of this post. Condition (V1) on its own is too weak. There are pairs of variables, all of whose values are compossible with one another, but which still overlap. Here’s a modification of Hoffmann-Kolss’s case: I will roll a standard six sided die. Then, consider the variables $O$ and $H$, where $$ O( R) = \left\{\begin{array}{l l} 1 & \text{ if the die lands on an odd number within $R$} \\\
0 & \text{ if the die lands on an even number within $R$} \\\
\ast & \text{ otherwise }
\end{array} \right. $$ and $$ H( R) = \left\{\begin{array}{l l} 1 & \text{ if the die lands on a high number ($>3$) within $R$} \\\
0 & \text{ if the die lands on a low number ($\leqslant 3$) within $R$} \\\
\ast & \text{ otherwise }
\end{array} \right. $$ No value of $O$ implies anything non-trivial about the value of $H$; nor does any value of $H$ imply anything non-trivial about the value of $O$. So (V1), and (IV), rule $O$ and $H$ distinct. However, there is a probabilitic correlation bewteen the values of $O$ and $H$. While the unconditional probability that $O = 1$ is 1/2, the probability that $O=1$, given that $H=1$, is 1/3. And this correlation is not causal, but rather logical. So our account of variable overlap should tell us that $O$ and $H$ overlap.

4. Shared Supervenience Bases

An incredibly natural reaction to this case is to think that the reason $O$ and $H$ overlap is that there is the more fine-grained variable $N$, which tells us the exact number the die lands on. That is, $N( R) = n$ if $R$ is a region within which the die lands on $n$, for $n \in \{ 1, 2, \dots, 6 \}$, and $N( R) = \ast$ otherwise. The values of $O$ and $H$ supervene upon the value of $N$, in the sense that $N$'s values imply the values of $O$ and $H$. Any region within which $N$ takes on an odd value is a region within which $O$ takes on the value $1$; and any region within which $N$ takes on an even value is a region within which $O$ takes on the value $0$. Similarly, any region within which $N$ takes on a value greater than 3 is a region within which $H$ takes on a value of $1$; and any region within which $N$ takes on a value less than or equal to 3 is a region within which $H$ takes on a value of $0$.

For the reasons we encountered above, we will want to generalize this notion of variable supervenience so that it is enough for one variable to supervene upon another that the value of one implies something non-trivial about the value of the other. Then, we might think that the right way to rule out overlapping variables like $O$ and $H$ is by saying, if there is a variable, $S$, with values that imply something non-trivial about the value of $U$, and with values that imply something non-trivial about the value of $V$, then then $U$ and $V$ overlap. Let’s call this sufficient condition for overlap Shared Supervenience Base.


Shared Supervenience Base The variables $U$ and $V$ overlap if

SSB) there is some variable $S$ with a value which implies something non-trivial about the value of $U$, $$ \exists s \in \mathscr{R}(S) \,\,\, \exists \mathbf{u} \subsetneq \mathscr{R}(U) \quad S = s \subseteq U \in \mathbf{u} $$ and a value which implies something non-trivial about the value of $V$, $$ \exists s \in \mathscr{R}(S) \,\,\, \exists \mathbf{v} \subsetneq \mathscr{R}(V) \quad S = s \subseteq V \in \mathbf{v} $$


(Shared Supervenience Base, by the way, is essentially the route which Hoffmann-Kolss ends up taking, though there are some superficial differences.)

Notice that (SSB) is strictly stronger than (IV). That is, any variables which (IV) rules overlapping will be ruled overlapping by (SSB); though (SSB) rules some variables overlapping which (IV) does not, like $O$ and $H$. To see that (SSB) will agree with (IV) when it says two variables overlap, recall that (IV) is equivalent to (V1), and then note that (V1) is the special case of (SSB) in which $S = U$.

Notice also that (SSB) is not just condition (V3) from Overlapping Variables. If some value of $U$ implies something non-trivial about the value of $V$, then let’s say that $U$ implies $V$. Then, (SSB) rules that two variables, $U$ and $V$, overlap when there is some third variable which implies both $U$ and $V$. (V3), on the otherhand, rules that two variables, $U$ and $V$, overlap when there is some third variable which is implied by both $U$ and $V$.

Notice also that (V3) is capable of correctly classifying $O$ and $H$ as overlapping. For both $O$ and $H$ imply something non-trivial about the value of the variable $N$. For instance, $O = 1 \subseteq N \in \{ 1, 3, 5\}$, and $H=1 \subseteq N \in \{4, 5, 6 \}$.

Since events are just singly-valued variables, (SSB) also gives a sufficient condition for the overlap of events. In this special case, (SSB) says that the events $e$ and $f$ overlap if there is some third event, $s$, which implies both $e$ and $f$, $s \subseteq e$ and $s \subseteq f$.

I believe that we should reject (SSB), and that, instead, we should endorse the Lewisian (V3). To see why, we can just focus on what (SSB) says about events.

Let’s suppose that the assassination of Archduke Ferdinand by Gavrilo Princip is an event. And let’s suppose that it is essentially an assassination, with a gun, of Archduke Ferdinand, and by Gavrilo Princip. That is, no region gets mapped to $1$ by this event unless it is a region within which Gavrilo Princip shoots Archduke Ferdinand dead. Call the event “$a$”, for assassination. It seems important to have $a$ included in our ontology—this event caused the start of World War I, and in order for it to do this, it must exist.

Let’s suppose also that Gavrilo Princip’s pulling the trigger is an event, and that this event is essentially a pulling of a trigger by Gavrilo Princip. That is, no region gets mapped to $1$ by this event unless it is a region within which Gavrilo Princip pulls the trigger of a gun. Call this event “$p$”, for Princip.

And let’s additionally suppose that Archduke Ferdinand’s death is an event, and that this event is essentially a dying of Archduke Ferdinand. That is, no region gets mapped to $1$ by this event unless it is a region within which Archduke Ferdinand dies. Call this event “$f$”, for Ferdinand.

It seems important to have the events $p$ and $f$ in our ontology, since it seems evident that $p$ caused $f$. For this reason, it is also important than $p$ and $f$ be distinct. If we say that they overlap, then our account of causation would incorrectly tell us that $p$ did not cause $f$.

But note that $a$ implies both $p$ and $f$. Any region within which $a$ occurs is a region within which $p$ occurs. And any region within which $a$ occurs is also a region within which $f$ occurs. So (SSB) tells us, incorrectly, that $p$ and $f$ overlap. Therefore, if we accept (SSB), then we could not say that Gavrilo Princip’s pulling the trigger caused the death of Archduke Ferdinand. We’d better not accept (SSB), then.

Notice that the same verdict does not follow from (V3) of the Lewisian Overlapping Variables. For $p$ does not imply $a$; nor does $f$ imply $a$. Princip could pull the trigger without assassinating Archduke Ferdinand. So too could the Archduke die without being assassinated by Princip. So I think there is compelling reason to reject (SSB) and to instead endorse (V3). (V3) allows us to correctly rule that $O$ and $H$ overlap without incorrectly classifying $p$ and $f$ as overlapping.


A. Loose Ends

A.1. Lewis is Committed to Overlapping Events

Above, I claimed that Overlapping Events was entailed by Lewis’s sufficient conditions for overlap. Given the variant notation, this is far from obvious. For the curious (and to assuage my own nagging conscience) I’ll give a proof here. Let’s introduce $\hat{e}$ for a function which maps a region $R$ to $1$ iff the event occurs in that region (the event’s merely occuring within that region is not enough). As before, we can use $\hat{e}$ for the set of regions which get mapped to $1$ by $\hat{e}$. Then $\hat{e}$ will be an event as Lewis formally defined them.

Lewis gave sufficient conditions for overlap in terms of a parthood relation. He said that $\hat{e}$ and $\hat{f}$ overlap if either (1) $\hat{e}$ is a part of $\hat{f}$; (2) $\hat{f}$ is a part of $\hat{e}$; or (3) there is some event $\hat{\imath}$ which is a part of both $\hat{e}$ and $\hat{f}$. What I wish to show here is that $e \subseteq f$ suffices for $\hat{f}$ being a part of $\hat{e}$. This will show that overlap according to Overlapping Events suffices for overlap according to Lewis.

On page 255 of Events, Lewis defines his implication relation as follows,

Let us say that event $e$ implies event $f$ iff, necessarily, if $e$ occurs in a region then also $f$ occurs in that region. Considered as classes, event $e$ is a subclass included in class $f$.

This use of ‘implies’ differs from the one I used above. It is like the one I used above, but applied to events after we put the hats on. To say that $\hat{e}$ implies $\hat{f}$ is to say that $\hat{e} \subseteq \hat{f}$. Later, on page 258, Lewis defines the relation of being essentially part of as follows (with minor notational changes)

Let us say that event $f$ is essentially part of event $e$, iff, necessarily, if $e$ occurs in a region, then also $f$ occurs in a subregion included in that region.

If we use $\hat{f} \sqsubseteq \hat{e}$ to stand for this relation, then we have that $$ \hat{f} \sqsubseteq \hat{e} := (\forall R) (\hat{e}( R) = 1 \rightarrow (\exists R’) ( R’ \subseteq R \wedge \hat{f}( R’) = 1) ) $$ We may now prove the following lemma.

Then, on page 259, Lewis defines the relation of being a part of as follows (with minor notational changes)

Let us say that occurrent event $f$ is part of occurrent event $e$ iff some occurrent event that implies $f$ is essentially part of some occurrent event that implies $e$.

If we use “$\hat{f}P\hat{e}$” to stand for “$\hat{f}$ is a part of $\hat{e}$”, then this tells us that $$ \hat{f}P\hat{e} ,,:=,, (\exists \hat{\imath}) (\exists \hat{\jmath}) ( \hat{\imath} \subseteq \hat{f} \wedge \hat{\jmath} \subseteq \hat{e} \wedge \hat{\imath} \sqsubseteq \hat{\jmath} ) $$ We can then prove Lemma 2.

Putting together Lemma 1 and 2, we have that $e \subseteq f$ suffices for $\hat{f}$ being a part of $\hat{e}$. Therefore, a) $e \subseteq f$ suffices for $\hat{f}$ being a part of $\hat{e}$; b) $f \subseteq e$ suffices for $\hat{e}$ being a part of $\hat{f}$; and c) there being an event $i$ such that $e \subseteq i$ and $f \subseteq i$ suffices for there being an event $\hat{\imath}$ such that $\hat{\imath}$ is a part of $\hat{e}$ and $\hat{\imath}$ is a part of $\hat{f}$. So, if two events are overlapping according to Overlapping Events, then they will be overlapping according to Lewis.

A.2. (IV) is Equivalent to (V1)

Above, I claimed that Woodward’s (IV) from Incompossible Values is equivalent (V1) from Overlapping Variables. To see this, suppose that (V1) is true, so that $U = u \subseteq V \in \mathbf{v}$, for some $u \in \mathscr{R}(U)$ and some $\mathbf{v} \subsetneq \mathscr{R}(V)$. It follows that there is some value of $V$, $v^* \notin \mathbf{v}$, such that there is no region within which $U = u$ and $V = v^* $. Therefore, there is no worldly region within which $U=u$ and $V = v^* $, and thus $U$ and $V$ overlap according to (IV).

Going in the other direction, suppose that $U$ and $V$ are distinct according to (IV). Then, there is some value of $U$—call it ‘$u^* $’—and some value of $V$—call it ‘$v^* $’—such that there is no worldly region within which $U=u^* $ and $V = v^* $. But then, $U=u^* $ implies something non-trivial about the value of $V$, namely, that $V \neq v^* $. So $U$ and $V$ are distinct according to (V1).


2017, Jun 25

The Impossibility of a Paretian Liberal

A foundational assumption in welfare economics is that, if everyone prefers A to B, then A is better than B. This normative claim is called the ‘weak Pareto principle’. At a first glance, this principle can appear unimpeachable. At the very least, it appears to be a sensible principle for policy makers to adopt.

Many of us are also committed to the principle that there are some choices that should be left up to the individual. Even if everyone else prefers that my nose be pierced, if I prefer it unpierced, then it is better for my nose to remain unpierced. A state-of-affairs in which I am compelled to pierce my nose, against my wishes, is worse than a state-of-affairs in which I get to make up my own mind about whether my nose is pierced. And that’s so no matter how many other people would prefer to see me with a pierced nose. Call this committment ‘minimal liberalism’.

In 1970, Amartya Sen produced an amazing result which seems to show that minimal liberalism and the weak Pareto principle are inconsistent with one another. Today I want to rehearse Sen’s result, and introduce an objection to Sen’s way of formalizing ‘minimal liberalism’ first made by Allan Gibbard. I think that Gibbard’s objection teaches us that Sen formulated liberalism incorrectly. However, I’ll conclude by showing that a better formulation of liberalism, one that avoids Gibbard’s objection, is also inconsistent with the weak Pareto principle.

The Impossibility of a Paretian Liberal, take 1

Sen’s result relies upon a formal framework in which we think of social goodness as determined by the preferences of the individuals living in a society. So we suppose that all the members of our society have their own preference ordering over state-of-affairs. Individual $i$ has the preference ordering $\succeq_i$. That’s a weak preference ordering; I’ll assume that we can get a strong perference ordering $\succ_i$ out of a weak one through the definition $A \succ_i B := A \succeq_i B \wedge B \not\succeq_i A$.

Given the preferences of every individual, $\succeq_i$, a social welfare function $W$ delivers a group preference ordering, $\succeq_G$, which tells us which states-of-affairs the group prefers to which other state-of-affairs.

$$ W:\,\, \left [ \begin{array}{c} \succeq_1 \\\
\succeq_2 \\\
\vdots \\\
\succeq_N \\\
\end{array} \right ] \,\,\to \,\,\,\,\, \succeq_G $$

An implicit normative assumption is that we may treat the group preference ordering as a betterness ordering. That is: if $A \succ_G B$, then $A$ is better than $B$. So we can think of the subscripted ‘$G$’ as standing either for ‘group’ or for ‘goodness’. (I’ll also assume throughout, by the way, that the social welfare function $W$ will always be defined, no matter which collection of individual preference orderings we hand it.)

Sen then interprets the weak Pareto principle and minimal liberalism in terms of this social welfare function. In these terms, the weal Pareto principle says that, if everyone prefers $A$ to $B$, then the group must prefer $A$ to $B$. And minimal liberalism says that every person is decisive with respect to some choice. That is: every person has at least one pair of options such that, whatever that person’s preferences are between those two options, that becomes the group’s preference. If I prefer my nose being pierced to my nose not being pierced, then that’s what the group prefers, too. And if I prefer my nose not being pierced to my nose being pierced, then that’s what the group prefers.


Weak Pareto Principle If $A \succ_i B$ for all $i$, then $A \succ_G B$.

Minimal Liberalism For all $i$, there is at least one pair of alternatives, $A$ and $B$, such that, if $A \succ_i B$, then $A \succ_G B$ and, if $B \succ_i A$, then $B \succ_G A$.


Sen additionally assumes that, if we’re going to interpret $\succeq_G$ as a social betterness ordering, then it had better not land us in cycles. That is, it had better not tell us that $A$ is better than $B$, $B$ is better than $C$, and $C$ is better than $A$.


No Cycles

There is no sequence of states-of-affairs $A_1, A_2, \dots, A_N$ such that $$ A_1 \succ_G A_2 \succ_G \dots \succ_G A_N \succ_G A_1 $$


Sen then proves that there is no social welfare function which satisfies Weak Pareto Principle, Minimal Liberalism, and No Cycles. I won’t go through the formal proof here; but I’ll go through a nice illustrative example from Sen (cultural references have been updated).

Prude is outraged and offended by Fifty Shades of Gray. Lewd, on the other hand, is delighted by the book. Prude would most prefer that nobody read the filth. However, if somebody must read it, Prude would rather read it himself than expose a libertine like Lewd to its influence. Lewd would most prefer that both he and Prude read the book. However, if only one of them is to read it, Lewd would rather it be Prude—he relishes the thought of Prude’s horrified reactions.

Thus, Prude and Lewd’s preference orderings are given in the following table.

Prude Lewd
1st. Neither reads ($N$) Both read ($B$)
2nd. Prude reads ($P$) Prude reads ($P$)
3rd. Lewd reads ($L$) Lewd reads ($L$)
4th. Both read ($B$) Neither reads ($N$)

For the purposes of illustration, suppose that Prude and Lewd are the only people in this society. And let’s suppose that whether you read a book is a matter which ought to be left up to the individual. If you prefer to read, then it’s better that you read; if you prefer to not read, then it’s better that you refrain. That’s all we need to bring out the conflict between Weak Pareto Principle, Minimal Liberalism, and No Cycles.

The only difference between $P$ and $N$ is whether Prude reads. It should be entirely up to Prude whether he read or not. Since he prefers to not read, Minimal Liberalism tells us that it is better if he doesn’t. So $$ N \succ_G P $$ Note that both Prude and Lewd prefer $P$ to $L$. So, by the Weak Pareto Principle, $$ P \succ_G L $$ And, the only difference between $L$ and $N$ is whether Lewd reads. It should be entirely up to Lewd whether he read or not. Since he prefers to read, Minimal Liberalism tells us that it is better if he does. So $$ L \succ_G N $$ But now we’ve violated No Cycles.
$$ N \succ_G P \succ_G L \succ_G N $$

So it seems that we face a choice: either reject the Weak Pareto Principle, or reject Minimal Liberalism. You can’t be both a Paretian and a liberal. If you want to be a liberal, you’d better reject the Weak Pareto principle.

The Impossibility of a Liberal?

Actually, matters are worse. Even rejecting the Weak Pareto Principle won’t get you out of trouble. Allan Gibbard showed that Minimal Liberalism leads to cycles all by itself, even without the Weak Pareto Principle.

Consider Match and Clash. Clash is a non-conformist. She would prefer having a pierced nose, but what’s most important to her is that her fashion be different from Match’s. So she wants to pierce her nose if (but only if) Match doesn’t pierce hers. Match is a follower. She doesn’t want to pierce her nose, but she does want her fashion to match Clash’s. So she wants to pierce her nose if (and only if) Clash pierces hers. Therefore, Clash and Match’s preferences are given by the following table.

Clash Match
1st. Clash pierces ($C$) Neither pierce ($N$)
2nd. Match pierces ($M$) Both pierce ($B$)
3rd. Both pierce ($B$) Clash pierces ($C$)
4th. Neither pierce ($N$) Match pierces ($M$)

For the purposes of illustration, let’s suppose that Match and Clash are the only people in this society. And Let’s suppose that whether your nose is pierced is the kind of thing which ought to be left up to the individual. If you prefer a pierced nose, then it’s better if your nose is pierced. And if you don’t, then it’s better if it’s not pierced.

Note that the only difference between $C$ and $N$ is whether Clash pierces. And Clash prefers $C$ to $N$. So the liberal says that $C$ is better than $N$. $$ C \succ_G N $$ The only difference between $N$ and $M$ is whether Match pierces. And Match prefers $N$ to $M$. So the liberal says that $N$ is better than $M$. $$ N \succ_G M $$ The only difference between $M$ and $B$ is whether Clash pierces. And Clash prefers $M$ to $B$. So the liberal says that $M$ is better than $B$. $$ M \succ_G B $$ And, finally, the only difference between $B$ and $C$ is whether Match pierces. Since Match prefers $B$ to $C$, the liberal says that $B$ is better than $C$. $$ B \succ_G C $$ But now we’ve contradicted No Cycles. $$ C \succ_G N \succ_G M \succ_G B \succ_G C $$ So Minimal Liberalism is inconsistent with No Cycles all by itself. We didn’t have to bring up the Weak Pareto Princple at all.

The Impossibility of a Paretian Liberal, take 2

Sen’s principle Minimal Liberalism assumes that the right way to think about liberalism is in terms of the decisiveness of individual preference. Sen’s liberal thinks that, if Prude prefers to not read, then it’s better if Prude not read (all else equal). And, if Lewd prefers to read, then it’s better if Lewd read (again, all else equal). What Gibbard’s case shows us, I think, is that this way of understanding liberalism is misguided.

And, in retrospect, we should recognize that we ought to have rejected Minimal Liberalism as a characterization of liberalism on independent grounds. Liberals think that certain self-regarding decisions should be left up to the individual. One of Mill’s arguments for this was that the individual is in a better position to know what’s best for them than the rest of society. However, Mill didn’t think that individuals were necessarily right about what’s in their own best interest.

Liberals should acknowledge that people can and do make self-regarding choices that make them worse off. A libertine liberal like Lewd will grant that what Prude reads should be left up to him. But that won’t keep Lewd from thinking that Prude’s diet of religious drivel is making his life worse. And surely Lewd should also think that, all else equal, it’s better for people’s lives to go better, and worse for them to go worse.

So the liberal shouldn’t think that, when it comes to self-regarding decisions, people’s actual preferences are objectively best. What they should think is that, when it comes to self-regarding decisions, it is better to allow people to choose for themselves than for the state to choose for them. That is: the liberal ought to think that, when it comes to self-regarding decisions, it is worse to deprive people of liberty than it is to allow them to make their own poor choices.

(At least, this is what a consequentialist liberal ought to think—though a non-consequentialist liberal is free to admit that things would be better if people were compelled to make the right choices, though such an arrangement would be unjust in spite of its betterness. I’m very sympathetic to this kind of position, but I’ll put it aside for the nonce.)

Our earlier discussion did not draw any distinction between possibilities in which people were forced to take certain options and those in which they freely chose those options. Let’s introduce this distinction, and use it to formulate our principle of liberalism.

Liberalism as Non-Compulsion does not fall prey to Gibbard-style objections. It is easy to see that the principle on its own could never give rise to cycles of betterness. The set of all possible states-of-affairs is partitioned by those in which all self-regarding choices are free and those in which some self-regarding choice is not free. And all Liberalism as Non-Compulsion says is that everything in the former set is better than everything in the latter set. On its own, this won’t lead to a cycle.

However, conjoined with the Weak Pareto Principle, this new principle of liberalism does run into cycles, in precisely the same way as before.

Let’s return to Sen’s example of Prude and Lewd. First, we’ll distinguish between those possibilities in which all choices are free and those possibilities in which some choices are compelled. We’ll subscript each of the previous states-of-affairs with an ‘$F$’ if they are states-of-affairs in which all choices are free and with a ‘$C$’ if it is a state-of-affairsin which people are compelled against their will to act in a certain way.

  • $N_F$: Both Prude and Lewd freely choose to not read.
  • $N_C$: Both Prude and Lewd are forced to not read.
  • $L_F$: Lewd freely chooses to read and Prude freely chooses to not read.
  • $L_C$: Lewd is forced to read, and Prude is forced to not read.
  • $P_F$: Prude freely chooses to read and Lewd freely chooses to not read.
  • $P_C$: Prude is forced to read and Lewd is forced to not read.
  • $B_F$: Both Prude and Lewd freely choose to read.
  • $B_C$: Both Prude and Lewd are forced to read.

Now suppose that, while both Prude values freedom—so that, all else equal, he would rather have Lewd and himself choose freely than be compelled—he values it less than he does keeping the filth of Fifty Shades from spreading. And, while Lewd values freedom—so that, all else equal, he would rather have Prude and himself choose freely than be compelled—he values it less than Prude’s disgust, and his delight, at the book’s depravity. So, if we ignore the subscripts, their preferences are the same as before, and otherwise, each of them prefers outcomes where choices are free.

Prude Lewd
1st. $N_F$ $B_F$
2nd. $N_C$ $B_C$
3rd. $P_F$ $P_F$
4th. $P_C$ $P_C$
5th. $L_F$ $L_F$
6th. $L_C$ $L_C$
7th. $B_F$ $N_F$
8th. $B_C$ $N_C$

Now, by Liberalism as Non-Compulsion, the outcome where Lewd freely chooses to read and Prude freely refrains is better than the outcome where Prude is forced to read and Lewd to refrain. $$ L_F \succ_G P_C
$$ But both Prude and Lewd prefer $P_C$ to $L_F$. So, by the Weak Pareto Principle, $$ P_C \succ_G L_F $$ And this contradicts No Cycles. $$ L_F \succ_G P_C \succ_G L_F $$

In sum: Gibbard showed us that we ought to reformulate Sen’s Minimal Liberalism. However, the best reformulation doesn’t get us out of the conflict with the Weak Pareto Principle. So the conflict is genuine. If you are a liberal, you cannot be a Paretian. If you are a liberal, you should deny that people’s preferences determine goodness in the way that Pareto imagined. If you are a Paretian, then you cannot be a liberal. If you are a Paretian, you should deny that it’s always best for self-regarding decisions to be left to the individual.


2017, May 31

Local and Global experts

Contemporary epistemology is replete with principles of expert deference. Epistemologists have claimed that you should treat the chances, your future selves, your rational self, and your epistemic peers as experts. What this means is that you should try to align your credences with theirs.

There are lots of ways you might try to align your credences with those of some expert function. (That expert function could be the chances, or it could be your future credences, or something else altogether. The particular function won’t matter, so I’ll just call the expert function, whatever it is, ‘$\mathscr{E}$’.) My focus here will be on just two ways of aligning your credences with $\mathscr{E}$'s: 1) by treating it as a local expert; and 2) by treating it as a global expert.


Local Expert

You treat $\mathscr{E}$ as a local expert iff, for all propositions $a$, and all numbers $n \in [0, 1]$, $$ C(a \mid \langle \mathscr{E}(a) = n \rangle) = n, \,\, \text{if defined} $$

Global Expert

You treat $\mathscr{E}$ as a global expert iff, for all propositions $a$, and all potential credence functions $E$, $$ C(a \mid \langle \mathscr{E} = E \rangle) = E(a), \,\, \text{ if defined} $$


In these definitions, $C$ is your own credence function. You should read ‘$\mathscr{E}$’ as a definite description, along the lines of ‘the credence function of the expert’. This definite description may refer to different credence functions at different worlds. And I am using the brackets ‘$\langle ,, \rangle$’ to denote propositions. Thus, ‘$\langle \mathscr{E}(a) = n \rangle$’ is the propositions that the expert’s credence that a is n. It is true at those worlds where $\mathscr{E}$'s credence in the proposition $a$ is $n$. And $\langle \mathscr{E} = E \rangle$ is the proposition that $E$ is the expert’s entire credence function, true at those worlds $w$ such that $\mathscr{E}_w = E$ (‘$\mathscr{E}_w$’ is $\mathscr{E}$'s credence function at world $w$).

It’s not immediately obvious what the relationship is between these two different ways of treating a function as an expert. You might think that they are equivalent, in the sense that you will treat $\mathscr{E}$ as a local expert if and only if you treat them as a global expert. In fact, they are not equivalent. Treating $\mathscr{E}$ as a global expert entails treating $\mathscr{E}$ as a local expert, but the converse is not true. (Throughout, by the way, I’m assuming probabilism and I’m assuming that your credences are defined over a finite number of worlds).


Proposition 1

If you treat $\mathscr{E}$ as a global expert, then you treat them as a local expert as well. However, you may treat $\mathscr{E}$ as a local expert without treating them as a global expert.


Proof. Note that $\{ \langle \mathscr{E} = E \rangle \mid E(a) = n \}$ is a partition of $\langle \mathscr{E}(a) = n \rangle$. If you treat $\mathscr{E}$ as a global expert, then for each $E$ such that $E(a) = n$, $C(a \mid \langle \mathscr{E} = E \rangle) = n$. It then follows from conglomerability (which follows from the probability axioms when the number of worlds is finite) that $C(a \mid \langle \mathscr{E}(a) = n \rangle) = n$.

To see that you may treat $\mathscr{E}$ as a local expert without treating them as a global expert, suppose that there are three possible worlds, $w_1$, $w_2$, and $w_3$, and that the expert’s credence function at each of those worlds is as shown below (the example originates from Gaifman’s 1988 article “A Theory of Higher-Order Probabilities”). (In the matrix, by the way, the $i$th row gives $\mathscr{E}$'s credence distribution over $w_1, w_2$ and $w_3$ at the world $w_i$.)

Figure 1

Figure 1

And suppose that your own credence distribution over $w_1, w_2,$ and $w_3$ is such that $C(\{w_i\}) =$ 1/3, for $i = 1, 2, 3$. Then, for every proposition $a$ and every number $n$, $C(a \mid \langle \mathscr{E}(a) = n \rangle) = n$. For instance, if $a = \{ w_1, w_2 \}$ and $n = 0.5$, then

$$ \begin{align} C(\{ w_1, w_2 \} \mid \langle \mathscr{E}(\{ w_1, w_2 \}) = 0.5 \rangle) &= C(\{ w_1, w_2 \} \mid \{ w \mid \mathscr{E}_w(\{ w_1, w_2 \})=0.5 \} ) \\\
&= C(\{ w_1, w_2 \} \mid \{ w_2, w_3 \}) \\\
&= 0.5 \end{align} $$

And the same is true for every other choice of $a$ and $n$, as you may check for yourself. Nevertheless, it is impossible to treat $\mathscr{E}$ as a global expert, since, so long as $C$ is a probability function,

$$ C(\{ w_1 \} \mid \langle \mathscr{E} = \mathscr{E}_{w_1} \rangle) =C(\{ w_1 \} \mid \{ w_1 \}) = 1 $$

But $\mathscr{E}_{w_1}(\{ w_1 \}) = 0.5 \neq 1$. QED.

So a principle of local deference is strictly weaker than a principle of global deference. Or, a perhaps better way of thinking about things: there are strictly more functions which can be treated as local experts than there are functions which can be treated as global experts.

This is a prima facie exciting observation, since a common objection to principles of global deference is that it is possible to treat $\mathscr{E}$ as a global expert if and only if $\mathscr{E}$ is certain of what their own credences are (because the focus is usually on certain ideal credence functions, certainty about your own credences is generally called immodesty). That is to say, it is possible to treat $\mathscr{E}$ as a global expert if and only if they are immodest—if and only if, for every world $w$, $\mathscr{E}_w(\langle \mathscr{E} = \mathscr{E}_w \rangle) = 1$. For suppose that $\mathscr{E}$ were modest—that is, suppose that, for some world $w$, $\mathscr{E}_w(\langle \mathscr{E} = \mathscr{E}_w \rangle) \neq 1$. And suppose that you treat $\mathscr{E}$ as a global expert. Then, substituting $\langle \mathscr{E} = \mathscr{E}_w \rangle$ in for $a$ and $\mathscr{E}_w$ in for $E$ in the definition of Global Expert, we have

$$ C(\langle \mathscr{E} = \mathscr{E}_w \rangle \mid \langle \mathscr{E} = \mathscr{E}_w \rangle ) = \mathscr{E}_w(\langle \mathscr{E} = \mathscr{E}_w \rangle ) \neq 1 $$

But the probability axioms require $C(a \mid a)$ to be 1 (or undefined) for all $a$.

So: if you think functions which aren’t certain of their own values should nevertheless be treated as experts, then you will think that we need a characterization of “treating a function as an expert” which goes beyond Global Expert. A common suggestion is to treat $\mathscr{E}$ as a modest expert.


Modest Expert

You treat $\mathscr{E}$ as a modest expert if and only if, for all propositions $a$ and all potential credence functions $E$, $$ C(a \mid \langle \mathscr{E} = E \rangle) = E(a \mid \langle \mathscr{E} = E \rangle) $$


But perhaps the move to such principles is too hasty. Perhaps we can get by just with principles of local deference. For note that the expert shown in figure 1 is modest; yet they can be treated as a local expert. So there are at least some modest functions which can be treated as local experts. And perhaps these are all the modest experts we need.

For this reason, the relationship between local and global experts is dialectically important to some debates in epistemology. For instance, Christensen endorses the claim that you should treat your currently rational self as a local expert. Elga criticizes this position on the grounds that it requires certainty that you are rational—however, in order to argue for this conclusion, he must first re-present Christensen’s principle as the claim that you should treat your rational self as a global expert (note: Elga recognizes that the second principle is stronger than the first). Perhaps, in the face of these criticisms, Christensen should hold tight to his original principle; perhaps it affords all the modesty we need.

No such luck, I’m afraid. Although there are some functions which can be treated as local experts but not global experts, these functions are incredibly singular. In fact, there is a good sense in which the function shown in figure 1 is the only kind of function which can be treated at a local, but not global, expert.

Given a function $\mathscr{E}$, from worlds to probability distributions over those same worlds, we can generate a Kripke frame $< \mathscr{W}, R >$ from $\mathscr{E}$ as follows: $\mathscr{E}_w(\{ x \}) \neq 0$ if and only if $w$ bears the relation $R$ to $x$ (or, as I shall say, if and only if $w$ sees $x$).

Let’s say that a Kripke frame $< \mathscr{W}, R >$ is cyclic iff

  1. Every world $w \in \mathscr{W}$ sees itself and exactly one other world.
  2. Every world $w \in \mathscr{W}$ is seen by exactly one distinct world.
  3. There are no two worlds $w, x$ such that $w$ sees $x$ and $x$ sees $w$.

A sample cyclic frame is shown below.

Figure 2

Figure 2

Note that the function from figure 1 will generate a cyclic frame in which, for each $w \in \mathscr{W}$, $\mathscr{E}_w(w) =$ 1/2. Let’s call any function like this a uniform cyclic function (‘uniform’ because at every world $\mathscr{E}$ gives equal probability to its actual world and the one other possible world it sees).


Uniform Cyclicity

A function $\mathscr{E}$ is uniform cyclic if and only if $\mathscr{E}$ generates a cyclic frame and, for every $w \in \mathscr{W}$, $\mathscr{E}_w(\{ w \}) =$ 1/2.


Now, it turns out that the functions which may be treated as local experts, but which may not be treated as global experts, are precisely the uniform cyclic ones. If a function is uniform cyclic, then you may treat it as a local expert, but not as a global expert. And if a function $\mathscr{E}$ is not uniform cyclic, then you can treat $\mathscr{E}$ as a local expert if and only if you can treat it as a global expert.


Proposition 2

It is possible to treat a function $\mathscr{E}$ as a local expert but not possible to treat them as a global expert when and only when $\mathscr{E}$ is uniform cyclic.

The only credences which treat such a function as a local expert are those which are uniform over the worlds in each cycle.


The proof of this proposition is quite long and tedious, so I’m putting it in a separate document here.

What Proposition 2 means, I think, is that we don’t have to fret about the difference between the local and global formulations of various principles of expert deference. For what the proposition tells us is that nobody should endorse a principle of local deference without thereby endorsing a principle of global deference. To endorse a principle of local deference without endorsing a principle of global deference is to say that uniformly cyclic functions are deserving of epistemic deference, but no other immodest function is. This strikes me as entirely unmotivated.

If we think that you should treat the probability function which generates the cyclic frame in figure 2 as an expert, then we should also think that you should treat the probability function which generates the frame shown in figure 3 as an expert.

Figure 3

Figure 3

After all, the only difference between the frame in figure 2 and the frame in figure 3 is that we have taken the single possibility $w_1$ in figure 2 and divided it into two sub-possibilities $w_1$ and $w_1'$ in figure 3. We could suppose that, at all worlds in figure 3, $\mathscr{E}$ gives the proposition $\{ w_1, w_1’ \}$ precisely the same probability it gave the singleton proposition $\{ w_1 \}$ in figure 2. If that’s so, then say that $\mathscr{E}$ reduces to uniform cyclicity. After all, if we just collapse the possibilities $w_1$ and $w_1'$, then we get back a uniform cyclic function. The difference between a uniform cyclic function and a function which merely reduces to uniform cyclicity ought not make any difference with respect to whether some supposed expert is deserving of epistemic deference, nor how that deference ought to be shown. However, Proposition 2 assures us that such minor changes in representation do make a difference with respect to whether we can treat the function as a local expert. So, if we’re in for treating some function as a local expert, we shouldn’t demure from treating them as a global expert as well.

So I think that Christensen, e.g., effectively has committed himself to the view that your rational self must be immodest. While his claim that you should treat your rational self as a local expert does not on its own entail this conclusion, it follows with the rather weak assumption that, if a uniform cyclic function is deserving of epistemic deference, then so too is a function which merely reduces to uniform cyclicity. Unless Christensen believes that 1) our rational selves could be uniform cyclic, but 2) they could not merely reduce to uniform cyclicity, he should also think that you should treat your rational self as a global expert. And this entails that your rational self is immodest.

TL;DR: you might have thought that principles of local deference are equivalent to principles of global deference. They’re not. Principles of local deference are weaker than principles of global deference. But they’re really not much weaker—just slightly. And there’s really no good reason to treat any function as a local expert but not a global expert. So, while they’re ever-so-slightly different, really, you shouldn’t ever worry about the differences.

2017, May 3

The Brier Measure is not strictly proper (as epistemologists have come to use that term)

In recent years, formal epistemologists have gotten interested in measures of the accuracy of a credence function. One famous measure of accuracy is the one suggested by Glenn Brier. Given a (finite) set $\Omega =$ { $\omega_1, \omega_2, \dots, \omega_N$ } of possible states of the world, the Brier measure of the accuracy of a credence function $c$ at the state $\omega_i$ is

$$ \mathfrak{B}(c, \omega_i) = - (1-c({ \omega_i }))^2 - \sum_{j \neq i} c({ \omega_j })^2 $$

And formal epistemologists usually say that a measure of accuracy $\mathfrak{A}$ is strictly proper iff every probability function expects itself (and only itself) to have the highest $\mathfrak{A}$-value.


Strict Propriety A measure of accuracy $\mathfrak{A}$ is strictly proper iff, for every probability function $p$ and every credence function $c \neq p$, the $p$-expectation of $p$'s $\frak{A}$-accuracy is strictly greater than the $p$-expectation of $c$'s $\frak{A}$-accuracy. That is: for every probability $p$ and every credence $c \neq p$,

$$ \sum_{i = 1}^N p({ \omega_i }) \cdot \mathfrak{A}(p, \omega_i) > \sum_{i = 1}^N p({ \omega_i }) \cdot \mathfrak{A}(c, \omega_i) $$


(‘Weak propriety’ is the property you get when you swap out ‘$>$’ for ‘$\geq$’.)

The point of today’s post is that, contrary to what I once thought (and perhaps contrary to what some others thought as well—though this could be a confusion localized to my own brain), the Brier score is not strictly proper.

First, a bit of background: Given a (finite) set $\Omega =$ { $\omega_1, \omega_2, \dots, \omega_N$ } of possible states of the world, we can call any set of states in $\Omega$ a ‘proposition’. And I’ll call a set of propositions, $\mathscr{F}$, a ‘field’. Given a pair $(\Omega, \mathscr{F})$, with $\mathscr{F} \subseteq \wp(\Omega)$, a credence function, $c$, is just any function from $\mathscr{F}$ to the unit interval, $[0, 1]$.

A credence function $c$ is a probability function if it additionally satisfies the following two constraints:

  1. $c(\Omega) = 1$.
  2. For all $A, B \in \mathscr{F}$ such that $A \cap B = \emptyset$, $c(A \cup B) = c(A) + c(B)$.

To see that the Brier measure $\mathfrak{B}$ is not strictly proper, consider the set of states $\Omega =$ { $\omega_1, \omega_2$ } and the field $\mathscr{F} =$ { $\emptyset,$ { $\omega_1$ }, { $\omega_2$ }, $\Omega$ }. Then, consider the probabilistic $p$ and the non-probabilistic $c$, both defined over the field $\mathscr{F}$. $$ \begin{array}{r | c c} A \in \mathscr{F} & p(A) & c(A) \\\hline \varnothing & 0 & 1 \\\
\{\omega_1\} & 1/2 & 1/2 \\\
\{\omega_2\} & 1/2 & 1/2 \\\
\Omega & 1 & 0
\end{array} $$

The $p$-expected Brier accuracy of $p$ is

$$ \begin{aligned} \mathbb{E}_p \left[ \mathfrak{B}(p) \right] &= p({ \omega_1 }) \cdot \mathfrak{B}(p, \omega_1) + p({ \omega_2 }) \cdot \mathfrak{B}(p, \omega_2) \\\
&= 1/2 \cdot \left[ -(1-1/2)^2 - (1/2)^2 \right] + 1/2 \cdot \left[ -(1-1/2)^2 - (1/2)^2 \right] \\\
&= - 1/2 \end{aligned} $$

And the $p$-expected Brier accuracy of $c$ is likewise

$$ \begin{aligned} \mathbb{E}_p \left[ \mathfrak{B}(c) \right] &= p({ \omega_1 }) \cdot \mathfrak{B}(c, \omega_1) + p({ \omega_2 }) \cdot \mathfrak{B}(c, \omega_2) \\\
&= 1/2 \cdot \left[ -(1-1/2)^2 - (1/2)^2 \right] + 1/2 \cdot \left[ -(1-1/2)^2 - (1/2)^2 \right] \\\
&= - 1/2 \end{aligned}$$

So there is a probabilistic $p$ and a credence function $c \neq p$ such that $p$ expects $c$ to be just as Brier accurate as $p$ is itself. So the Brier measure of accuracy is not strictly proper.

Some have used the term ‘strict propriety’ differently than I defined it above. In the first place, Brier himself did not intend his measure to apply to credence functions, which are functions from arbitrary propositions to the unit interval, but rather forecasts, which he treated as assignments of real numbers from the unit interval to each individual state $\omega_i \in \Omega$. (Brier even required these numbers to sum to 1.) If you are in a context where you are evaluating, not credence functions, but forecasts, then you might want to define the notion of strict propriety like this:


Strict Propriety for Forecasts

A measure of accuracy $\mathfrak{A}$ is strictly proper for forecasts iff, for every probabilistic forecast $p$ and every forecast $f \neq p$, the $p$-expectation of $p$'s $\frak{A}$-accuracy is strictly greater than the $p$-expectation of $f$'s $\frak{A}$-accuracy. That is: for every probabilistic forecast $p$ and every forecast $f \neq p$,


$$\sum_{i = 1}^N p({\omega_i}) \cdot \mathfrak{A}(p, \omega_i) > \sum_{i = 1}^N p({\omega_i}) \cdot \mathfrak{A}(f, \omega_i) $$

And the Brier measure is strictly proper for forecasts. It’s just not strictly proper as epistemologists have been using that term, applied to arbitrary credence functions.

What is a strictly proper measure of accuracy for credence functions is this quadratic measure, which is also sometimes called the Brier measure (though it’s not the measure Brier himself explicitly endorsed):

$$\mathfrak{Q}(c, \omega) = - \sum_{A \in \mathscr{F}} ( \chi_A(\omega) - c(A) )^2 $$

(Here, ‘$\chi_A(\omega)$’ is the characteristic function for the proposition $A$, which maps a state $\omega$ to $1$ if $A$ is true in that state and $0$ otherwise.)