This post accompanies the article “Rejoinder to ‘Endogeneity bias in marketing research: Problem, causes and remedies’” in Industrial Marketing Management (IMM). We restate some sections of the main text for completeness in this document. Throughout we include R code to run the 2SLS estimations, create graphs, and generate the datasets.

### Model setting

Before we start working on simulated data, let’s understand how much bias we expect if we use each of OLS, ZKHL, and 2SLS.

We have a data generating process (DGP), which involves a dependent variable, `y`

, two independent variables `x1`

and `x2`

, and a latent variable `omVar`

. The latent variable is not observed and not measured. Therefore, from all the estimations it will be omitted.

The DGP for `y`

is as follows:

\[ \begin{align*} y &= \beta_0 + \beta_1 x1+\beta_2x2 + \beta_3omVar +\eta \qquad(1a)\\ \textrm{where,}\quad \beta_0 &= 2, \\ \beta_1 &= 3,\\ \beta_2 &= 1,\\ \beta_3 &= 3 \quad \textrm{and} \quad \eta \sim N(\mu = 0,\sigma =10) \end{align*} \]

However, there is more to this DGP. We also put various restrictions on `x1`

, `x2`

, and `omVar`

:

- \(x2 \sim N(\mu = 0,\sigma =1)\)
- \(x1 = z1 + z2 + omVar\)
- \(z1 \sim N(\mu = 0,\sigma =1)\)
- \(z2 \sim N(\mu = 0,\sigma =1)\)
- \(omVar \sim N(\mu = 0,\sigma =1)\)

- \(cov(z1,z2) = cov(z1,omVar) = cov(z2,omVar) = 0\)
- \(cov(z1,x2) = cov(z2,x2) = cov(x2,omVar) = cov(x1,x2) = 0\)

Because we don’t observe `omVar`

, the model that we are estimating is as follows:

\[ y = \hat\beta_0 + \hat\beta_1 x1+\hat\beta_2x2+\hat\epsilon \qquad(2a) \]

In this equation, `omVar`

is confined to the error term \(\hat\epsilon\). If `omVar`

was uncorrelated with `x1`

and `x2`

then we would have no concerns about endogeneity. However, by design `x1`

and `omVar`

are correlated because we generated `x1`

as follows:

\[ x1 = z1 + z2 + omVar \qquad(3a) \]

Thus, `x1`

and \(\hat\epsilon\) should be correlated. However, when we estimate Equation (2), OLS, by design, produces estimate of \(\hat\epsilon\) that is **uncorrelated** with `x1`

. So where did the correlation go? The answer to this question lies in the covariances between `x1`

and `omVar`

as well as `x2`

and `omVar`

. Note that by construction, the covariance between `x2`

and `omVar`

is zero. However, the covariance between `x1`

and `omVar`

is non zero. Here is why:

\[ \begin{aligned} x1 &= z1 + z2 + omVar \qquad(3) \\ cov(x1, omVar) &= cov(z1, omVar) + cov(z2, omVar) + cov(omVar, omVar) \\ &= 0 + 0 + var(omVar) \\ &= 1 \end{aligned} \]

The above derivation relies on three important aspects of our data-generating process.

- By design
`z1`

and`omVar`

are uncorrelated - By design
`z2`

and`omVar`

are uncorrelated - Variance of
`omVar`

is 1 because it follows normal distribution with mean = 0 and standard deviation = 1.

Because of the non-zero covariance between `x1`

and `omVar`

, OLS assigns the variation in `y`

due to `omVar`

to `x1`

. In simple words, `x1`

gets credit for the work `omVar`

did. This is reflected in the estimated coefficient of `x1`

. Thus, the estimate of \(\beta_1\) in Equation (2)—call it \(\hat\beta_1\)—will be biased. The formula for \(E(\hat\beta_1)\) is as follows:

\[ E(\hat\beta_1) = \beta_1 + \beta_3\cdot\frac{cov(x1,omVar)}{var(x1)} \qquad(4) \]

Above we show that \(cov(x1,omVar)\) equals 1. Let’s get the value of \(var(x1)\). Starting with \((3a)\):

\[ \begin{aligned} x1 &= z1 + z2 + omVar \qquad(3) \\ var(x1) &= var(z1) + var(z2) + var(omVar) \\ &= 1 + 1 + 1 \\ &= 3 \end{aligned} \]

Substitute \(var(x1)\) in Equation (4). Also note that from Equation (1), \(\beta_1\) is 3 and \(\beta_3\) is also 3.

\[ \begin{aligned} E(\hat\beta_1) &= \beta_1 + \beta_3\cdot\frac{cov(x1,omVar)}{var(x1)} \qquad(4) \\ & = 3 + 3\cdot\frac{1}{3} \\ &= 4 \end{aligned} \]

Thus, OLS will wrongly estimate \(\beta_1\) and provide an inflated estimate.

### The bias due to ZKHL methodology

ZKHL recommend that to correct the bias in OLS estimates, we carry out the following two-step process:

- Regress
`x1`

on`x2`

and instrument`z1`

and obtain the residuals. - Regress
`y`

on x2 and the residuals from Step 1*instead of*x1

We argue that this will not correct the bias and, in fact, exacerbate the endogeneity concern. The simple reason for this is that in Step 1 we retain the endogenous part of `x1`

(the residuals). In Step 2, we use that purely ** endogenous** part of

`x1`

in place of `x1`

!We present a formal derivation of the increased bias in this subsection.

First, we estimate the following regression to get the residuals \[ x1 = \gamma_0 + \gamma_1z1 + \gamma_2x2 + \zeta \]

Recall that, \(cov(z1,x2) = cov(x1,x2) = 0\). Given these covariances and Equation (3), we can conclude that \(\zeta = z2+omVar\). Thus, in the first stage, effectively we remove `z1`

from `x1`

.

Next, we regress `y`

on \(\zeta\) and `x2`

. For the \(\hat\beta_1\) we modify Equation (4):

\[ E(\hat\beta_1) = \beta_1 + \beta_3\cdot\frac{cov(\zeta,omVar)}{var(\zeta)} \qquad(5) \]

Given all the relationships in the DGP, we can show that \(cov(\zeta,omVar) = var(omVar) = 1\) and \(var(\zeta) = var(z2) + var(omVar) = 2\). Substituting into (5a) yields:

\[ \begin{aligned} E(\hat\beta_1) &= \beta_1 + \beta_3\cdot\frac{cov(\zeta,omVar)}{var(\zeta)} \qquad(5) \\ & = 3 + 3\cdot\frac{1}{2} \\ &= 4.5 \end{aligned} \]

Thus, rather than reducing the bias, ZKHL add to the bias. If we had used both the instruments `z1`

and `z2`

above, ZKHL method would have raised the bias even further leading to an estimated \(\beta_1\) equalling 6.

### Simulations

We create a simulated dataset 1000 times. We will use the first of these 1000 data sets for most of the analysis here and in the main text. However, we will use the full 1000 simulations for visualization.

Each dataset has 1000 observations. That’s the sample size. We follow these steps for data creation:

- Create an omitted variable
`omVar`

which will be used only in data generation process but not in the estimation. Omission of this variable leads to endogeneity. - Create two instruments
`z1`

and`z2`

and one exogenous independent variable`x2`

. - Create the endogeous independent variable
`x1`

as a function of`z1`

,`z2`

, and`omVar`

. We put equal weights on each of them. - Finally, create the dependent variable
`y`

as a function of`x1`

,`x2`

, and`omVar`

, and random noise.

To reiterate, in actual estimation, none of the models will have access to `omVar`

. It is used only in the data generating process.

#### Reminder: How is `x1`

endogenous?

Note that in the data generating process, `omVar`

affects both `x1`

and `y`

. However, in the estimation we will not have `omVar`

. Thus, effectively `omVar`

is absorbed into the error term, which is correlated with `x1`

.

```
sampleSize <- 1000 # number of observations in each dataset
myList = list() # List that will hold the 1000 simulations
set.seed(12345)
for (i in 1L:1000L) {
omVar <- rnorm(sampleSize, mean = 0, sd = 1) # Omitted variable
z1 <- rnorm(sampleSize, mean = 0, sd = 1) # Instrument 1
z2 <- rnorm(sampleSize, mean = 0, sd = 1) # Instrument 2
x2 <- rnorm(sampleSize, mean = 0, sd = 1) # Exogenous independent variable
x1 <- z1 + z2 + omVar # Endogenous independent variable
# Dependent variable is generated as follows
y <- 2 + 3*x1 + 1*x2 + 3*omVar + 10*rnorm(sampleSize, mean = 0, sd = 1)
myList[[i]] <- data.frame(y = y,
x1 = x1,
x2 = x2,
z1 = z1,
z2 = z2,
omVar = omVar,
sample = i)
rm(omVar, z1, z2, x1, x2, y)
}
```

### Estimation using OLS, ZKHL, and 2SLS

For estimation, we will use only the first of the 1000 datasets we created above.

`myData <- myList[[1]]`

### OLS Estimation

OLS is going to produce biased and inconsistent estimates because it ignores the endogeneity in `x1`

variable and treats it as exogenous.

`summary(lm(y ~ x1 + x2, data = myData ))`

```
##
## Call:
## lm(formula = y ~ x1 + x2, data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.116 -7.103 0.418 6.977 34.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0868 0.3351 6.227 6.98e-10 ***
## x1 3.7352 0.1945 19.202 < 2e-16 ***
## x2 1.2846 0.3521 3.649 0.000277 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.6 on 997 degrees of freedom
## Multiple R-squared: 0.275, Adjusted R-squared: 0.2735
## F-statistic: 189.1 on 2 and 997 DF, p-value: < 2.2e-16
```

As we see in the results above, the estimated coefficient of `x1`

is much higher than 3, which is the true parameter. We can test whether this is significantly different from 3 rather than 0 to ascertain that the bias is non zero. A t-test comparing the coefficient of `x1`

to 0 returns a t-statistic of 3.78 with p-value 0. This strongly supports that the bias is non zero.

### Model with ZKHL method

Now let’s see whether the method propsed by ZKHL reduces the bias and gets us an estimated coefficient closer to 3. According to them, we first regress the endogenous variable `x1`

on the set of instrument and exogenous variables (`z1`

and `x2`

respectively.) Next, we regress `y`

on `x2`

and the residuals from the first stage instead of the endogenous variable `x1`

.

If this method corrects the bias, we should see the coefficient on the residuals of `x1`

from first-stage regression closer to 3.

```
myData$x1_1 <- lm(x1~z1 + x2, data = myData)$residuals
summary(lm(y ~ x1_1 + x2, data = myData))
```

```
##
## Call:
## lm(formula = y ~ x1_1 + x2, data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.372 -7.525 0.097 7.036 37.858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0538 0.3438 5.974 3.22e-09 ***
## x1_1 4.3416 0.2505 17.335 < 2e-16 ***
## x2 1.0807 0.3611 2.993 0.00283 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.87 on 997 degrees of freedom
## Multiple R-squared: 0.2369, Adjusted R-squared: 0.2353
## F-statistic: 154.7 on 2 and 997 DF, p-value: < 2.2e-16
```

In contrast to the expectation that the coefficient on `x1`

first-stage residuals will be closer to 3, we find that it has actually moved farther away from 3! The t-test comapring the estimate of \(\beta_1\) to 3 returns t-statistic of 5.357 and p-value 0. Let’s correct this by using the fitted values of `x1`

from the first-stage estimation rather than residuals from stage 1.

### 2SLS

```
myData$x1_2 <- lm(x1~z1 + x2, data = myData)$fitted.values
summary(lm(y ~ x1_2 + x2, data = myData))
```

```
##
## Call:
## lm(formula = y ~ x1_2 + x2, data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.083 -8.018 -0.482 7.393 37.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0774 0.3821 5.437 6.83e-08 ***
## x1_2 2.6805 0.3671 7.302 5.81e-13 ***
## x2 1.2270 0.4018 3.054 0.00232 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.08 on 997 degrees of freedom
## Multiple R-squared: 0.05727, Adjusted R-squared: 0.05538
## F-statistic: 30.28 on 2 and 997 DF, p-value: 1.707e-13
```

Indeed the coefficient of fitted values of `x1`

is close to 3. The t-test comparing it with 3 returns a non significant t-statistic of -0.87 (p-value = 0.384). Thus, it seems that 2SLS has corrected the bias.

### 2SLS using `ivreg`

function

In R, we can perform 2SLS using the function `ivreg`

from `AER`

package. This function provides more diagnistics so we prefer it over doing 2SLS manually. Also manual 2SLS will result in standard errors that are too low. First, we lost 3 degrees of freedom due to the first-stage regression. Second, anytime we include an independent variable that was estimated from a previous regression it introduces more sampling variation into the estimation which is not accounted for in straightforward OLS standard error calculation (Wooldridge, 2010) ^{1}. We didn’t correct for this in the manual calculation above. `ivreg`

, as well as any other 2SLS routine using modern econometric software, automatically provides correct standard errors. We didn’t correct these in the manual calculation above. Although the difference is tiny in this case due to large sample size, with smaller sample size, this difference can be considerably large.

```
ivM1 <- ivreg(y ~ x1 + x2|x2 + z1, data = myData)
summary(ivM1, diagnostics = TRUE)
```

```
##
## Call:
## ivreg(formula = y ~ x1 + x2 | x2 + z1, data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.9030 -7.1996 0.3698 6.9697 33.9642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0774 0.3400 6.110 1.43e-09 ***
## x1 2.6805 0.3267 8.206 7.02e-16 ***
## x2 1.2270 0.3575 3.432 0.000624 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 997 573.24 < 2e-16 ***
## Wu-Hausman 1 996 17.18 3.7e-05 ***
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.75 on 997 degrees of freedom
## Multiple R-Squared: 0.2536, Adjusted R-squared: 0.2521
## Wald test: 38.25 on 2 and 997 DF, p-value: < 2.2e-16
```

### What if we have more than one instrument for `x1`

?

In this example, we will include the second instrument `z2`

to the model.

```
ivM2 <- ivreg(y ~ x1 + x2|x2 + z1 + z2, data = myData)
summary(ivM2, diagnostics = TRUE)
```

```
##
## Call:
## ivreg(formula = y ~ x1 + x2 | x2 + z1 + z2, data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.9440 -7.1656 0.3717 6.9970 33.9605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0770 0.3405 6.099 1.52e-09 ***
## x1 2.6254 0.2424 10.830 < 2e-16 ***
## x2 1.2240 0.3579 3.420 0.000651 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 996 988.150 < 2e-16 ***
## Wu-Hausman 1 996 68.988 3.21e-16 ***
## Sargan 1 NA 0.063 0.802
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.77 on 997 degrees of freedom
## Multiple R-Squared: 0.2513, Adjusted R-squared: 0.2498
## Wald test: 63.21 on 2 and 997 DF, p-value: < 2.2e-16
```

Now the diagnostics also provide us with Sargan’s overidentification test. The null hypothesis of the test is that the overidentifying restrictions are valid and the instruments are exogenous. We find that the test fails to reject the null hypothesis. Thus, `z1`

and `z2`

are acceptable instruments. Note that when the errors are spherical (neither heterskedastic nor autocorrelated), Sargan’s test is equivalent to Hansen’s test of overidentifying restrictions. If you suspect that heteroskedasticity is a concern, it is preferable to use robust standard errors and report Hansen’s J statistic instead of Sargan’s test statistic. In Stata, the user-written command `ivreg2`

reports all the relevant tests post estimation.

### Control function

Finally, we modify ZKHL’s procedure and include both the residuals of `x1`

from the first stage and `x1`

in the second stage. Thus, we regress `y`

on `x1`

, `x2`

, and first-stage residuals of `x1`

. This fixes the bias in the estimates.

`summary(lm(y ~ x1 + x1_1 + x2, data = myData))`

```
##
## Call:
## lm(formula = y ~ x1 + x1_1 + x2, data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.552 -7.083 0.448 6.630 33.311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0774 0.3324 6.249 6.1e-10 ***
## x1 2.6805 0.3194 8.393 < 2e-16 ***
## x1_1 1.6611 0.4008 4.144 3.7e-05 ***
## x2 1.2270 0.3495 3.510 0.000467 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.51 on 996 degrees of freedom
## Multiple R-squared: 0.2873, Adjusted R-squared: 0.2851
## F-statistic: 133.8 on 3 and 996 DF, p-value: < 2.2e-16
```

Note that the coefficient of `x1_1`

, which is the residual from the first-stage regression, is statistically significant. This indicates that we have accounted for endogeneity by including the control function. A more formal test for the presence of endogeneity, which helps in deciding whether we need to use 2SLS in the first place, is Durbin-Wu-Hausman (DWH) test. The regression form of this test is simply an F-test on the significance of the first-stage residuals. `ivreg`

reports the result of DWH test above. It strongly suggests presence of at least one endogenous regressor and rejects the null hypothesis of absence of endogeneity.

## Graphs

It’s possible that our results above are a function of the random sample that we generated. Afterall, we used only the first simulated sample among the 1000 we created originally. In order to address this concern, we run OLS, ZKHL, and 2SLS on each of the 1000 samples. We then plot the estimates of `x1`

from these models.

### OLS

```
m1 <- structure(list(`(Intercept)` = double(),
x1 = double(),
x2 = double()),
class = 'data.frame')
for (i in 1L:1000L) {
m1 <- rbind(m1, as.data.frame(t(lm(y ~ x1 + x2, data = myList[[i]])$coefficients)))
}
```

### ZKHL

```
m2 <- structure(list(`(Intercept)` = double(),
x1_1 = double(),
x2 = double()),
class = 'data.frame')
for (i in 1L:1000L) {
myList[[i]]$x1_1 <- lm(x1~z1 + x2, data = myList[[i]])$residuals
m2 <- rbind(m2, as.data.frame(t(lm(y ~ x1_1 + x2, data = myList[[i]])$coefficients)))
}
```

### 2SLS

```
m3 <- structure(list(`(Intercept)` = double(),
x1_2 = double(),
x2 = double()),
class = 'data.frame')
for (i in 1L:1000L) {
myList[[i]]$x1_2 <- lm(x1~z1 + x2, data = myList[[i]])$fitted.values
m3 <- rbind(m3, as.data.frame(t(lm(y ~ x1_2 + x2, data = myList[[i]])$coefficients)))
}
```

Identify the model.

```
m1$Model <- 'OLS'
m2$Model <- 'ZKHL'
m3$Model <- '2SLS'
```

We use `ggplot2`

for plotting distributions of estimated coefficients. We also plot vertical lines corresponding to the means of the distributions. For this, the following code creates a separate dataset of means for each method.

```
mMeans <- rbind(m1,setNames(m2, names(m1)),setNames(m3,names(m1))) %>%
mutate(Model2 = factor(Model, levels = c('OLS','ZKHL','2SLS'))) %>%
group_by(Model2) %>%
summarise(x1 = mean(x1))
```

Now we can make the plot.

Recall that the true coefficient of `x1`

is 3. In this simulation, only 2SLS gives us the estimates centered around the true coefficient. As we show above, mathematically, the biased coefficient of OLS will be 4 and that of ZKHL will be 4.5. In the above plot we indeed get these biased estimates for OLS and ZKHL.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). Cambridge, MA.: MIT Press.↩