Image credit: Pexels and Sabrina Gelbart

DA 6813 Homework Solution

This note is pretty old. I have modified this note using dplyr package but still plenty of the code is still base R. The original note is available here: http://rpubs.com/malshe/224660. This file is just a small part of the original file.

The original homework questions are available here: http://rpubs.com/malshe/224662

library(dplyr)
library(here)

Get the data in

red <- read.csv(here::here("static", "data", "winequality-red.csv"),
                stringsAsFactors = F)

red$wine <- "red"

white <- read.csv(here::here("static", "data", "winequality-white.csv"),
                stringsAsFactors = F)

white$wine <- "white"

wine <- rbind(red,white)
wine$wine <- as.factor(wine$wine)

Get the summary of the data

summary(wine)
##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality         wine     
##  Min.   :3.000   red  :1599  
##  1st Qu.:5.000   white:4898  
##  Median :6.000               
##  Mean   :5.818               
##  3rd Qu.:6.000               
##  Max.   :9.000

Our dependent variable is quality. As we will be using it as a categorical variable in 3 of the 4 models, let’s look at its frequency distribution.

table(wine$quality)
## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5

Clearly, the categories at the extreme have very few observations. This will lead to problems in correctly categorizing these values. In order to overcome this problem, I will create two new variables. For support vector machines (SVM) and multinomial logistic model (MNL), I will create a new variable labeled quality.c which will be a factor variable with groups 3, 4, 8, and 9 combined in another group. We can label it anything we want as the labeling is meaningless for these methods. I will label this new group 3489 thus preserving the knowledge that this group came from 4 separate groups.

For ordinal regression, we have a little bit more information about the ordering of the groups. I will create a new variable quality.o. In this variable, I will combine 3, 4, and 5 and label it 5 to indicate this is 5 and lower. Similarly, I will combine 7, 8, and 9 and label it 7 to indicate that this group is 7 and above. Thus, we will effectively have only 3 groups.

Clearly, this will make model comparison a little bit tough but we have to give each model the best chance to perform even at this lower level of analysis.

wine <- wine %>%
          mutate(quality.c = quality) %>%
          mutate(quality.o = quality) %>%
          mutate(quality.c = replace(quality.c,
                                     quality %in% c(3,4,8,9), 3489)) %>%
          mutate(quality.o = replace(quality.o,
                                     quality %in% c(3,4), 5)) %>%
          mutate(quality.o = replace(quality.o,
                                     quality %in% c(8,9), 7)) %>%
          mutate(quality.c = factor(quality.c)) %>%
          mutate(quality.o = ordered(quality.o))

str(wine$quality.c)
##  Factor w/ 4 levels "5","6","7","3489": 1 1 1 2 1 1 1 3 3 1 ...
str(wine$quality.o)
##  Ord.factor w/ 3 levels "5"<"6"<"7": 1 1 1 2 1 1 1 3 3 1 ...

Now that we have new dependent variables, let’s take a look at the independent variables. As we are doing a predictive analysis (we have no plan to do inference), let’s understand the distribution and correlations of the variables and explore the need for transformation.

For this we will first get the descriptive statistics and correlations for all the numeric variables.

cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
as.data.frame(cormat)
cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
knitr::kable(as.data.frame(cormat),
             options = list(pageLength = 12,
              columnDefs = list(list(className = 'dt-center',
                                     targets = c(1:12)))))
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
fixed_acidity 1
volatile_acidity 0.22 1
citric_acid 0.32 -0.38 1
residual_sugar -0.11 -0.2 0.14 1
chlorides 0.3 0.38 0.04 -0.13 1
free_sulfur_dioxide -0.28 -0.35 0.13 0.4 -0.2 1
total_sulfur_dioxide -0.33 -0.41 0.2 0.5 -0.28 0.72 1
density 0.46 0.27 0.1 0.55 0.36 0.03 0.03 1
pH -0.25 0.26 -0.33 -0.27 0.04 -0.15 -0.24 0.01 1
sulphates 0.3 0.23 0.06 -0.19 0.4 -0.19 -0.28 0.26 0.19 1
alcohol -0.1 -0.04 -0.01 -0.36 -0.26 -0.18 -0.27 -0.69 0.12 0 1
quality -0.08 -0.27 0.09 -0.04 -0.2 0.06 -0.04 -0.31 0.02 0.04 0.44 1
library(ggcorrplot)
## Loading required package: ggplot2
#ggcorrplot(, hc.order = TRUE, type = "lower", lab = F)
ggcorrplot(round(cor(as.matrix(wine[,-c(13,14,15)])),2), 
           p.mat = cor_pmat(as.matrix(wine[,-c(13,14,15)])),
           hc.order = TRUE, type = "lower",
   outline.col = "white",
   ggtheme = ggplot2::theme_bw,
   colors = c("#6D9EC1", "white", "#E46726"))

In the above heat map, the crosses indicate non significant correlations. From the correlations it seems that most variables have their own unique information set. However, it appears that quality is strongly related to only a few variables. This is not great news. At this point one can think of transformations to increase the correlations between the variables. However, I am not going to do it as this will make this exercise broader than I expected.

Let’s get the descriptive staistics

library(moments)

desc <- as.data.frame(cbind(Mean = sapply(wine[,-c(13,14,15)], mean),
                      Median = sapply(wine[,-c(13,14,15)], median),
                      Std.Dev = sapply(wine[,-c(13,14,15)], sd),
                      CV = sapply(wine[,-c(13,14,15)], sd) / sapply(wine[,-c(13,14,15)], mean),
                      Skewness = sapply(wine[,-c(13,14,15)], skewness),
                      Kurtosis = sapply(wine[,-c(13,14,15)], kurtosis)))

round(desc,2)
##                        Mean Median Std.Dev   CV Skewness Kurtosis
## fixed_acidity          7.22   7.00    1.30 0.18     1.72     8.06
## volatile_acidity       0.34   0.29    0.16 0.48     1.49     5.82
## citric_acid            0.32   0.31    0.15 0.46     0.47     5.39
## residual_sugar         5.44   3.00    4.76 0.87     1.44     7.35
## chlorides              0.06   0.05    0.04 0.63     5.40    53.86
## free_sulfur_dioxide   30.53  29.00   17.75 0.58     1.22    10.90
## total_sulfur_dioxide 115.74 118.00   56.52 0.49     0.00     2.63
## density                0.99   0.99    0.00 0.00     0.50     9.60
## pH                     3.22   3.21    0.16 0.05     0.39     3.37
## sulphates              0.53   0.51    0.15 0.28     1.80    11.65
## alcohol               10.49  10.30    1.19 0.11     0.57     2.47
## quality                5.82   6.00    0.87 0.15     0.19     3.23

The most interesting column for me is the CV (coefficient of variation). This is the ratio of standard deviation to the mean. We have certain observations where CV is very low (e.g., 0 or 0.05). This means that the standard deviation is extremely small compared to the mean. Clearly we need some scaling here to remove the effect of the mean. One way to do that is to mean center all the variables so that we have zero mean all across. It retains the low standard deviation, however. To overcome this issue, we can divide all the variables by their standard deviations. This way we will normalize our data such that all the variables will have mean = 0 and standard deviation = 1.

Will that help with reducing skewness and kurtosis? Think about it for a moment before you read on.

I am going to scale the numeric variables.

wine2 <- wine
wine2[,c(1:12)] <- scale(wine[,c(1:12)])

desc2 <- as.data.frame(cbind(Mean = sapply(wine2[,c(1:12)], mean),
                      Median = sapply(wine2[,c(1:12)], median),
                      Std.Dev = sapply(wine2[,c(1:12)], sd),
                      Skewness = sapply(wine2[,c(1:12)], skewness),
                      Kurtosis = sapply(wine2[,c(1:12)], kurtosis)))

round(desc2,2)
##                      Mean Median Std.Dev Skewness Kurtosis
## fixed_acidity           0  -0.17       1     1.72     8.06
## volatile_acidity        0  -0.30       1     1.49     5.82
## citric_acid             0  -0.06       1     0.47     5.39
## residual_sugar          0  -0.51       1     1.44     7.35
## chlorides               0  -0.26       1     5.40    53.86
## free_sulfur_dioxide     0  -0.09       1     1.22    10.90
## total_sulfur_dioxide    0   0.04       1     0.00     2.63
## density                 0   0.06       1     0.50     9.60
## pH                      0  -0.05       1     0.39     3.37
## sulphates               0  -0.14       1     1.80    11.65
## alcohol                 0  -0.16       1     0.57     2.47
## quality                 0   0.21       1     0.19     3.23

Scaling doesn’t affect skewness or kurtosis. In order to alter these two moments, we need to use non linear transformation such as logarithmic or square root transformations. I’m going to do it through trial and error.

Normal distribution has skewness = 0 and kurtosis = 3. total_sulfur_dioxide, pH, alcohol, and quality seem to have this shape (a good idea is to plot distributions). I am concerned about fixed_acidity, chlorides, free_sulfur_dioxide, density, and sulphates due to high kurtosis (and skewness in some cases). Let’s take their log transform first and then scale these variables.

wine2[,c(1,5,6,8,10)] <- scale(log(wine[,c(1,5,6,8,10)]))

moments::skewness(wine2[,c(1,5,6,8,10)])
##       fixed_acidity           chlorides free_sulfur_dioxide 
##           0.8889319           0.8762698          -0.8340045 
##             density           sulphates 
##           0.4672599           0.4048986
moments::kurtosis(wine2[,c(1,5,6,8,10)])
##       fixed_acidity           chlorides free_sulfur_dioxide 
##            4.896783            5.305355            3.429675 
##             density           sulphates 
##            9.008338            3.701296

Except for density the remaining 4 variables benefited from log transformation. After plotting the distribution for density it appears that this might be because of a couple of extreme values. Although log transformation should have gotten rid of them, it seems it didn’t work out. So I replaced the two extreme values (there were 3 observations) with the next highest observation and then took log. As it turns out, the transformation paid off.

wine2$density <- ifelse(wine$density > 1.00369,
                        1.00369,
                        wine$density) # replace the values > 1.00369 by 1.00369
wine2$density <- scale(log(wine2$density))

moments::skewness(wine2$density)
## [1] -0.02258351
moments::kurtosis(wine2$density)
## [1] 2.254121

Now we have a dataset wine2 which has all the transformed variables. I am going to use it for the rest of the analysis.

___________ Terminated ___________

Related

comments powered by Disqus