Online Appendix

This post accompanies the article “Rejoinder to ‘Endogeneity bias in marketing research: Problem, causes and remedies’” in Industrial Marketing Management (IMM). We restate some sections of the main text for completeness in this document. Throughout we include R code to run the 2SLS estimations, create graphs, and generate the datasets.

Model setting

Before we start working on simulated data, let’s understand how much bias we expect if we use each of OLS, ZKHL, and 2SLS.

We have a data generating process (DGP), which involves a dependent variable, y, two independent variables x1 and x2, and a latent variable omVar. The latent variable is not observed and not measured. Therefore, from all the estimations it will be omitted.

The DGP for y is as follows:

\[ \begin{align*} y &= \beta_0 + \beta_1 x1+\beta_2x2 + \beta_3omVar +\eta \qquad(1a)\\ \textrm{where,}\quad \beta_0 &= 2, \\ \beta_1 &= 3,\\ \beta_2 &= 1,\\ \beta_3 &= 3 \quad \textrm{and} \quad \eta \sim N(\mu = 0,\sigma =10) \end{align*} \]

However, there is more to this DGP. We also put various restrictions on x1, x2, and omVar:

  1. \(x2 \sim N(\mu = 0,\sigma =1)\)
  2. \(x1 = z1 + z2 + omVar\)
    • \(z1 \sim N(\mu = 0,\sigma =1)\)
    • \(z2 \sim N(\mu = 0,\sigma =1)\)
    • \(omVar \sim N(\mu = 0,\sigma =1)\)
  3. \(cov(z1,z2) = cov(z1,omVar) = cov(z2,omVar) = 0\)
  4. \(cov(z1,x2) = cov(z2,x2) = cov(x2,omVar) = cov(x1,x2) = 0\)

Because we don’t observe omVar, the model that we are estimating is as follows:

\[ y = \hat\beta_0 + \hat\beta_1 x1+\hat\beta_2x2+\hat\epsilon \qquad(2a) \]

In this equation, omVar is confined to the error term \(\hat\epsilon\). If omVar was uncorrelated with x1 and x2 then we would have no concerns about endogeneity. However, by design x1 and omVar are correlated because we generated x1 as follows:

\[ x1 = z1 + z2 + omVar \qquad(3a) \]

Thus, x1 and \(\hat\epsilon\) should be correlated. However, when we estimate Equation (2), OLS, by design, produces estimate of \(\hat\epsilon\) that is uncorrelated with x1. So where did the correlation go? The answer to this question lies in the covariances between x1 and omVar as well as x2 and omVar. Note that by construction, the covariance between x2 and omVar is zero. However, the covariance between x1 and omVar is non zero. Here is why:

\[ \begin{aligned} x1 &= z1 + z2 + omVar \qquad(3) \\ cov(x1, omVar) &= cov(z1, omVar) + cov(z2, omVar) + cov(omVar, omVar) \\ &= 0 + 0 + var(omVar) \\ &= 1 \end{aligned} \]

The above derivation relies on three important aspects of our data-generating process.

  1. By design z1 and omVar are uncorrelated
  2. By design z2 and omVar are uncorrelated
  3. Variance of omVar is 1 because it follows normal distribution with mean = 0 and standard deviation = 1.

Because of the non-zero covariance between x1 and omVar, OLS assigns the variation in y due to omVar to x1. In simple words, x1 gets credit for the work omVar did. This is reflected in the estimated coefficient of x1. Thus, the estimate of \(\beta_1\) in Equation (2)—call it \(\hat\beta_1\)—will be biased. The formula for \(E(\hat\beta_1)\) is as follows:

\[ E(\hat\beta_1) = \beta_1 + \beta_3\cdot\frac{cov(x1,omVar)}{var(x1)} \qquad(4) \]

Above we show that \(cov(x1,omVar)\) equals 1. Let’s get the value of \(var(x1)\). Starting with \((3a)\):

\[ \begin{aligned} x1 &= z1 + z2 + omVar \qquad(3) \\ var(x1) &= var(z1) + var(z2) + var(omVar) \\ &= 1 + 1 + 1 \\ &= 3 \end{aligned} \]

Substitute \(var(x1)\) in Equation (4). Also note that from Equation (1), \(\beta_1\) is 3 and \(\beta_3\) is also 3.

\[ \begin{aligned} E(\hat\beta_1) &= \beta_1 + \beta_3\cdot\frac{cov(x1,omVar)}{var(x1)} \qquad(4) \\ & = 3 + 3\cdot\frac{1}{3} \\ &= 4 \end{aligned} \]

Thus, OLS will wrongly estimate \(\beta_1\) and provide an inflated estimate.

The bias due to ZKHL methodology

ZKHL recommend that to correct the bias in OLS estimates, we carry out the following two-step process:

  1. Regress x1 on x2 and instrument z1 and obtain the residuals.
  2. Regress y on x2 and the residuals from Step 1 instead of x1

We argue that this will not correct the bias and, in fact, exacerbate the endogeneity concern. The simple reason for this is that in Step 1 we retain the endogenous part of x1 (the residuals). In Step 2, we use that purely endogenous part of x1 in place of x1!

We present a formal derivation of the increased bias in this subsection.

First, we estimate the following regression to get the residuals \[ x1 = \gamma_0 + \gamma_1z1 + \gamma_2x2 + \zeta \]

Recall that, \(cov(z1,x2) = cov(x1,x2) = 0\). Given these covariances and Equation (3), we can conclude that \(\zeta = z2+omVar\). Thus, in the first stage, effectively we remove z1 from x1.

Next, we regress y on \(\zeta\) and x2. For the \(\hat\beta_1\) we modify Equation (4):

\[ E(\hat\beta_1) = \beta_1 + \beta_3\cdot\frac{cov(\zeta,omVar)}{var(\zeta)} \qquad(5) \]

Given all the relationships in the DGP, we can show that \(cov(\zeta,omVar) = var(omVar) = 1\) and \(var(\zeta) = var(z2) + var(omVar) = 2\). Substituting into (5a) yields:

\[ \begin{aligned} E(\hat\beta_1) &= \beta_1 + \beta_3\cdot\frac{cov(\zeta,omVar)}{var(\zeta)} \qquad(5) \\ & = 3 + 3\cdot\frac{1}{2} \\ &= 4.5 \end{aligned} \]

Thus, rather than reducing the bias, ZKHL add to the bias. If we had used both the instruments z1 and z2 above, ZKHL method would have raised the bias even further leading to an estimated \(\beta_1\) equalling 6.

Simulations

We create a simulated dataset 1000 times. We will use the first of these 1000 data sets for most of the analysis here and in the main text. However, we will use the full 1000 simulations for visualization.

Each dataset has 1000 observations. That’s the sample size. We follow these steps for data creation:

  1. Create an omitted variable omVar which will be used only in data generation process but not in the estimation. Omission of this variable leads to endogeneity.
  2. Create two instruments z1 and z2 and one exogenous independent variable x2.
  3. Create the endogeous independent variable x1 as a function of z1, z2, and omVar. We put equal weights on each of them.
  4. Finally, create the dependent variable y as a function of x1, x2, and omVar, and random noise.

To reiterate, in actual estimation, none of the models will have access to omVar. It is used only in the data generating process.

Reminder: How is x1 endogenous?

Note that in the data generating process, omVar affects both x1 and y. However, in the estimation we will not have omVar. Thus, effectively omVar is absorbed into the error term, which is correlated with x1.

sampleSize <- 1000 # number of observations in each dataset
myList = list() # List that will hold the 1000 simulations
set.seed(12345)
for (i in 1L:1000L) {
  omVar <- rnorm(sampleSize, mean = 0, sd = 1) # Omitted variable
  z1 <- rnorm(sampleSize, mean = 0, sd = 1)    # Instrument 1
  z2 <- rnorm(sampleSize, mean = 0, sd = 1)    # Instrument 2
  x2 <- rnorm(sampleSize, mean = 0, sd = 1)    # Exogenous independent variable
  x1 <- z1 + z2 + omVar      # Endogenous independent variable
  # Dependent variable is generated as follows
  y <- 2 + 3*x1 + 1*x2 + 3*omVar + 10*rnorm(sampleSize, mean = 0, sd = 1) 

  myList[[i]] <- data.frame(y = y,
                            x1 = x1,
                            x2 = x2,
                            z1 = z1,
                            z2 = z2,
                            omVar = omVar,
                            sample = i)
  rm(omVar, z1, z2, x1, x2, y)
}

Estimation using OLS, ZKHL, and 2SLS

For estimation, we will use only the first of the 1000 datasets we created above.

myData <- myList[[1]]

OLS Estimation

OLS is going to produce biased and inconsistent estimates because it ignores the endogeneity in x1 variable and treats it as exogenous.

summary(lm(y ~ x1 + x2, data = myData ))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = myData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.116  -7.103   0.418   6.977  34.033 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0868     0.3351   6.227 6.98e-10 ***
## x1            3.7352     0.1945  19.202  < 2e-16 ***
## x2            1.2846     0.3521   3.649 0.000277 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.6 on 997 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.2735 
## F-statistic: 189.1 on 2 and 997 DF,  p-value: < 2.2e-16

As we see in the results above, the estimated coefficient of x1 is much higher than 3, which is the true parameter. We can test whether this is significantly different from 3 rather than 0 to ascertain that the bias is non zero. A t-test comparing the coefficient of x1 to 0 returns a t-statistic of 3.78 with p-value 0. This strongly supports that the bias is non zero.

Model with ZKHL method

Now let’s see whether the method propsed by ZKHL reduces the bias and gets us an estimated coefficient closer to 3. According to them, we first regress the endogenous variable x1 on the set of instrument and exogenous variables (z1 and x2 respectively.) Next, we regress y on x2 and the residuals from the first stage instead of the endogenous variable x1.

If this method corrects the bias, we should see the coefficient on the residuals of x1 from first-stage regression closer to 3.

myData$x1_1 <- lm(x1~z1 + x2, data = myData)$residuals
summary(lm(y ~ x1_1 + x2, data = myData))
## 
## Call:
## lm(formula = y ~ x1_1 + x2, data = myData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.372  -7.525   0.097   7.036  37.858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0538     0.3438   5.974 3.22e-09 ***
## x1_1          4.3416     0.2505  17.335  < 2e-16 ***
## x2            1.0807     0.3611   2.993  0.00283 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.87 on 997 degrees of freedom
## Multiple R-squared:  0.2369, Adjusted R-squared:  0.2353 
## F-statistic: 154.7 on 2 and 997 DF,  p-value: < 2.2e-16

In contrast to the expectation that the coefficient on x1 first-stage residuals will be closer to 3, we find that it has actually moved farther away from 3! The t-test comapring the estimate of \(\beta_1\) to 3 returns t-statistic of 5.357 and p-value 0. Let’s correct this by using the fitted values of x1 from the first-stage estimation rather than residuals from stage 1.

2SLS

myData$x1_2 <- lm(x1~z1 + x2, data = myData)$fitted.values
summary(lm(y ~ x1_2 + x2, data = myData))
## 
## Call:
## lm(formula = y ~ x1_2 + x2, data = myData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.083  -8.018  -0.482   7.393  37.011 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0774     0.3821   5.437 6.83e-08 ***
## x1_2          2.6805     0.3671   7.302 5.81e-13 ***
## x2            1.2270     0.4018   3.054  0.00232 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.08 on 997 degrees of freedom
## Multiple R-squared:  0.05727,    Adjusted R-squared:  0.05538 
## F-statistic: 30.28 on 2 and 997 DF,  p-value: 1.707e-13

Indeed the coefficient of fitted values of x1 is close to 3. The t-test comparing it with 3 returns a non significant t-statistic of -0.87 (p-value = 0.384). Thus, it seems that 2SLS has corrected the bias.

2SLS using ivreg function

In R, we can perform 2SLS using the function ivreg from AER package. This function provides more diagnistics so we prefer it over doing 2SLS manually. Also manual 2SLS will result in standard errors that are too low. First, we lost 3 degrees of freedom due to the first-stage regression. Second, anytime we include an independent variable that was estimated from a previous regression it introduces more sampling variation into the estimation which is not accounted for in straightforward OLS standard error calculation (Wooldridge, 2010) 1. We didn’t correct for this in the manual calculation above. ivreg, as well as any other 2SLS routine using modern econometric software, automatically provides correct standard errors. We didn’t correct these in the manual calculation above. Although the difference is tiny in this case due to large sample size, with smaller sample size, this difference can be considerably large.

ivM1 <- ivreg(y ~ x1 + x2|x2 + z1, data = myData)
summary(ivM1, diagnostics = TRUE)
## 
## Call:
## ivreg(formula = y ~ x1 + x2 | x2 + z1, data = myData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -39.9030  -7.1996   0.3698   6.9697  33.9642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0774     0.3400   6.110 1.43e-09 ***
## x1            2.6805     0.3267   8.206 7.02e-16 ***
## x2            1.2270     0.3575   3.432 0.000624 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 997    573.24 < 2e-16 ***
## Wu-Hausman         1 996     17.18 3.7e-05 ***
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 997 degrees of freedom
## Multiple R-Squared: 0.2536,  Adjusted R-squared: 0.2521 
## Wald test: 38.25 on 2 and 997 DF,  p-value: < 2.2e-16

What if we have more than one instrument for x1?

In this example, we will include the second instrument z2 to the model.

ivM2 <- ivreg(y ~ x1 + x2|x2 + z1 + z2, data = myData)
summary(ivM2, diagnostics = TRUE)
## 
## Call:
## ivreg(formula = y ~ x1 + x2 | x2 + z1 + z2, data = myData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -39.9440  -7.1656   0.3717   6.9970  33.9605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0770     0.3405   6.099 1.52e-09 ***
## x1            2.6254     0.2424  10.830  < 2e-16 ***
## x2            1.2240     0.3579   3.420 0.000651 ***
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2 996   988.150  < 2e-16 ***
## Wu-Hausman         1 996    68.988 3.21e-16 ***
## Sargan             1  NA     0.063    0.802    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.77 on 997 degrees of freedom
## Multiple R-Squared: 0.2513,  Adjusted R-squared: 0.2498 
## Wald test: 63.21 on 2 and 997 DF,  p-value: < 2.2e-16

Now the diagnostics also provide us with Sargan’s overidentification test. The null hypothesis of the test is that the overidentifying restrictions are valid and the instruments are exogenous. We find that the test fails to reject the null hypothesis. Thus, z1 and z2 are acceptable instruments. Note that when the errors are spherical (neither heterskedastic nor autocorrelated), Sargan’s test is equivalent to Hansen’s test of overidentifying restrictions. If you suspect that heteroskedasticity is a concern, it is preferable to use robust standard errors and report Hansen’s J statistic instead of Sargan’s test statistic. In Stata, the user-written command ivreg2 reports all the relevant tests post estimation.

Control function

Finally, we modify ZKHL’s procedure and include both the residuals of x1 from the first stage and x1 in the second stage. Thus, we regress y on x1, x2, and first-stage residuals of x1. This fixes the bias in the estimates.

summary(lm(y ~ x1 + x1_1 + x2, data = myData))
## 
## Call:
## lm(formula = y ~ x1 + x1_1 + x2, data = myData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.552  -7.083   0.448   6.630  33.311 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0774     0.3324   6.249  6.1e-10 ***
## x1            2.6805     0.3194   8.393  < 2e-16 ***
## x1_1          1.6611     0.4008   4.144  3.7e-05 ***
## x2            1.2270     0.3495   3.510 0.000467 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.51 on 996 degrees of freedom
## Multiple R-squared:  0.2873, Adjusted R-squared:  0.2851 
## F-statistic: 133.8 on 3 and 996 DF,  p-value: < 2.2e-16

Note that the coefficient of x1_1, which is the residual from the first-stage regression, is statistically significant. This indicates that we have accounted for endogeneity by including the control function. A more formal test for the presence of endogeneity, which helps in deciding whether we need to use 2SLS in the first place, is Durbin-Wu-Hausman (DWH) test. The regression form of this test is simply an F-test on the significance of the first-stage residuals. ivreg reports the result of DWH test above. It strongly suggests presence of at least one endogenous regressor and rejects the null hypothesis of absence of endogeneity.

Graphs

It’s possible that our results above are a function of the random sample that we generated. Afterall, we used only the first simulated sample among the 1000 we created originally. In order to address this concern, we run OLS, ZKHL, and 2SLS on each of the 1000 samples. We then plot the estimates of x1 from these models.

OLS

m1 <- structure(list(`(Intercept)` = double(),
                 x1 = double(),
                 x2 = double()),
                class = 'data.frame')
for (i in 1L:1000L) {
  m1 <- rbind(m1, as.data.frame(t(lm(y ~ x1 + x2, data = myList[[i]])$coefficients)))
}

ZKHL

m2 <- structure(list(`(Intercept)` = double(),
                 x1_1 = double(),
                 x2 = double()),
                class = 'data.frame')
for (i in 1L:1000L) {
  myList[[i]]$x1_1 <- lm(x1~z1 + x2, data = myList[[i]])$residuals
  m2 <- rbind(m2, as.data.frame(t(lm(y ~ x1_1 + x2, data = myList[[i]])$coefficients)))
}

2SLS

m3 <- structure(list(`(Intercept)` = double(),
                 x1_2 = double(),
                 x2 = double()),
                class = 'data.frame')
for (i in 1L:1000L) {
  myList[[i]]$x1_2 <- lm(x1~z1 + x2, data = myList[[i]])$fitted.values
  m3 <- rbind(m3, as.data.frame(t(lm(y ~ x1_2 + x2, data = myList[[i]])$coefficients)))
}

Identify the model.

m1$Model <- 'OLS'
m2$Model <- 'ZKHL'
m3$Model <- '2SLS'

We use ggplot2 for plotting distributions of estimated coefficients. We also plot vertical lines corresponding to the means of the distributions. For this, the following code creates a separate dataset of means for each method.

mMeans <- rbind(m1,setNames(m2, names(m1)),setNames(m3,names(m1))) %>%
  mutate(Model2 = factor(Model, levels = c('OLS','ZKHL','2SLS'))) %>%
  group_by(Model2) %>%
  summarise(x1 = mean(x1))

Now we can make the plot.

Recall that the true coefficient of x1 is 3. In this simulation, only 2SLS gives us the estimates centered around the true coefficient. As we show above, mathematically, the biased coefficient of OLS will be 4 and that of ZKHL will be 4.5. In the above plot we indeed get these biased estimates for OLS and ZKHL.


  1. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). Cambridge, MA.: MIT Press.

Related