Image credit: Pexels and David Bartus

# Intuition behind Cross-Validation

Cross-validation error is an estimate of the out-of-sample error. Cross-validation is a great tool for helping modelers select a model with low out-of-sample error. The objective of this note is to show you how to write simple code to carry out cross-validation in R. I will post similar code for SAS later.

K-fold cross-validation involves splitting the sample in K equal and independent subsamples (i.e., there is no overlap in the subsamples). Next you estimate the model on K-1 subsamples combined and test for its accuracy on the left out subsample. Calculate the out-of-sample error using the following formula for kth subsample1. $MSE_k = \frac{\sum_{i \in{C_k}}(y_i - \hat{y}_i)^2}{n_k}\qquad(1)$ Where $MSE_k$ is the mean squared error, $y$ is the dependent variable in the kth subsample, $\hat{y}$ is the predicted dependent variable in the kth subsample, and $n_k$ is the sample size of the kth subsample. We are summing the squared difference between $y_i$ and $\hat{y}_i$ for all the $i$ observations in the kth subsample and then dividing it by $n_k$.

Once we calculate $MSE$ for all the K subsamples, the cross-validation error $CV_{(K)}$ is given by the following formula: $CV_{(K)} = \sum_{k = 1}^K \frac{n_k}{n} \cdot MSE_k\qquad(2)$

Note that in this formula $\frac{n_k}{n}$ appears because although we created subsamples of equal sizes, there are likely to be situations where the subsamples were not of equal size. For example, if the sample size is 1111 and we decide to use 10-fold cross-validation, there will be 9 groups with 111 observations and 1 group with 112 observations as 1111 is not perfectly divisible by 10. Thus, equation 2 is a general form.

Let’s work on R code now. I am going to use titanic_train data for this example. The dependent variable is Fare and predictor variables are Sex, Pclass, and Embarked. For more information on the variables, check the package titanic on CRAN.

library(titanic)
library(plyr)

# 2 observations in Embarked are missing so I delete them right in the beginning
d1 <- titanic_train[titanic_train$Embarked != "", ] # Create a variable k for the number of subsamples k <- 5 Next we will split this dataset into 5 parts randomly. I am going to use random assignment so there is a good chance that the sample sizes will not be equal even when I had the number of observations perfectly divisible by 5. I am creating a new variable called “CV_Group1” in the data set d1. This variable will have the group numbers. set.seed(100) d1$CV_Group1 <- sample(1:k, nrow(d1), replace = TRUE)

Let’s get a frequency table for this variable.

table(d1$CV_Group1) ## ## 1 2 3 4 5 ## 153 198 162 189 187 Clearly the 5 subsamples have very different sample sizes. At this point we can still continue as our equation 2 will take care of different sample sizes. However, if you wanted a more uniform distribution, you may use the following code. I will leave out the explanation of logic for you to determine on your own. r <- rep(1:k, times = nrow(d1) %/% k) r <- if (nrow(d1) %% k != 0) { append(r, seq(1, nrow(d1) %% k)) } else { return(r) } set.seed(100) d1$CV_Group2 <- sample(r, nrow(d1), replace = F)

table(d1$CV_Group2) ## ## 1 2 3 4 5 ## 178 178 178 178 177 Now we have a nice uniform distribution across the 5 groups. Before we carry out cross-validation, let’s run the linear regression model to understand how well it performs in the sample. model1 <- lm(Fare ~ as.factor(Sex) + as.factor(Pclass) + as.factor(Embarked), d1) summary(model1) ## ## Call: ## lm(formula = Fare ~ as.factor(Sex) + as.factor(Pclass) + as.factor(Embarked), ## data = d1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -74.12 -13.28 -1.28 2.20 426.07 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 98.541 3.733 26.400 < 2e-16 *** ## as.factor(Sex)male -12.278 2.820 -4.354 1.49e-05 *** ## as.factor(Pclass)2 -59.601 4.106 -14.517 < 2e-16 *** ## as.factor(Pclass)3 -65.197 3.426 -19.028 < 2e-16 *** ## as.factor(Embarked)Q -15.442 5.692 -2.713 0.006798 ** ## as.factor(Embarked)S -12.139 3.576 -3.394 0.000719 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39.39 on 883 degrees of freedom ## Multiple R-squared: 0.3754, Adjusted R-squared: 0.3718 ## F-statistic: 106.1 on 5 and 883 DF, p-value: < 2.2e-16 The model has an adjusted R2 equaling 0.37, which is decent for this type of models. Note that we have used as.factor() for all the 3 independent variables. This means that R will internally create dummy variables for these variables. From the summary of the model, we find that all the variables in the model are statistically significant. That’s good news! Males paid on average £12 less than females. No surprise that 2nd class passengers paid on average £59.6 less and 3rd class passengers paid £65 less than 1st class passengers. Passengers boarding at port ‘C’ paid the highest fare. At this point you can tweak the model to include more variables that make sense. For example, perhaps adding Age will make sense 2. I would also use interactions or products of independent variables. For examples, females traveling by 1st class may pay higher or lower than the males traveling by 1st class. However, there may not be any significant difference in their fares for other two classes. Interaction between Sex and Pclass should capture such nuances. Once you are happy with the model, it’s the time to test it out of the sample. For this we will use cross-validation. MSE <- rep(NA,k) nk <- rep(NA,k) for (i in 1:k) { model <- lm(Fare ~ as.factor(Sex) + as.factor(Pclass) + as.factor(Embarked), d1[d1$CV_Group2 != i, ])
p <- predict.lm(model,newdata = d1[d1$CV_Group2 == i, ]) MSE[i] <- mean((d1[d1$CV_Group2 == i, "Fare"] - p)^2)
nk[i] <- nrow(d1[d1CV_Group2 == i, ]) } MSE_CV <- sum(MSE*(nk / nrow(d1))) print(MSE_CV) ## [1] 1558.201 The mean squared error from cross-validation is MSE_CV. This can be used to compare different models. For example, you can use the model with and without Age to see which one performs better out of the sample. ## Cross-validation for classification problems Above we saw cross-validation when the dependent variable is continuous. The general method remains the same for classification problem but the measure for error will be different. There is no MSE for classification because we are not predicting a continuous variable. Instead there will be classification error, which is given by the following formula: $CE_k = 1-\frac{\sum_{i \in{C_k}}CL_i}{n_k}\qquad(3)$ where $CE_k$ is classification error $k^{th}$ subsample, $CL_i$ is the classification for $i^{th}$ observation of $k^{th}$ subsample and $n_k$ is the sample size.$CL_i$ is an indicator variable, such that CL_i = \left \{ \begin{aligned} &0, && \text{if misclassification} \\ &1, && \text{if correct classification} \end{aligned} \right. Similar to equation 2 above, once we calculate $CE$ for all the K subsamples, the cross-validation error $CV_{(K)}$ is given by the following formula: $CV_{(K)} = \sum_{k = 1}^K \frac{n_k}{n} \cdot CE_k\qquad(2)$ Let’s use this formula for Titanic problem using a binary logistic regression model. The model that we will estimate is as follows. Note that I have kept the predictor variables the same as the linear model although the dependent variable here is Survived. m1 <- glm(Survived ~ as.factor(Sex) + as.factor(Pclass) + as.factor(Embarked), data = d1, family = binomial) p <- predict.glm(m1, newdata = d1, type = "response") As our data set d1 already consists of the cross-validation grouping, we don’t have to perform this task once again. Instead we will just use the following code. CE <- rep(NA,k) nk <- rep(NA,k) for (i in 1:k){ model2 <- glm(Survived ~ as.factor(Sex) + as.factor(Pclass) + as.factor(Embarked), d1[d1CV_Group2 != i,],
family = binomial)
p <- predict.glm(model2,
newdata = d1[d1$CV_Group2 == i, ], type = "response") # p has predicted probabilities which we need to convert to 1s and 0s p <- ifelse(p >= 0.5, 1, 0) # Note that I'm overwriting p. This is NOT a good programming practice. cm <- table(d1[d1$CV_Group2 == i, "Survived"], p) # Confusion Matrix
CE[i] <- 1 - sum(diag(cm)) / sum(cm)
nk[i] <- nrow(d1[d1\$CV_Group2 == i, ])
}

CE_CV <- sum(CE*(nk / nrow(d1)))
print(paste0("CV Misclassification Error = ", CE_CV))
## [1] "CV Misclassification Error = 0.226096737907761"

That’s all. This means that our model has misclassified 22.6% of the observations out of the sample. Is this good or bad? This will depend on how many survivors we had in our data and also on the other measures such as precision and recall.

1. This example is for continuous dependent variable. I will modify this post later to do cross-validation on a categorical dependent variable

2. Recall that there were many missing values in this variable so make sure to impute them