This note is pretty old. I have modified this note using `dplyr`

package but still plenty of the code is still base R. The original note is available here: http://rpubs.com/malshe/224660. This file is just a small part of the original file.

The original homework questions are available here: http://rpubs.com/malshe/224662

```
library(dplyr)
library(here)
```

## Get the data in

```
red <- read.csv(here::here("static", "data", "winequality-red.csv"),
stringsAsFactors = F)
red$wine <- "red"
white <- read.csv(here::here("static", "data", "winequality-white.csv"),
stringsAsFactors = F)
white$wine <- "white"
wine <- rbind(red,white)
wine$wine <- as.factor(wine$wine)
```

Get the summary of the data

`summary(wine)`

```
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide
## Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median :0.04700 Median : 29.00 Median :118.0
## Mean :0.05603 Mean : 30.53 Mean :115.7
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :0.61100 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.219 Mean :0.5313 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
## quality wine
## Min. :3.000 red :1599
## 1st Qu.:5.000 white:4898
## Median :6.000
## Mean :5.818
## 3rd Qu.:6.000
## Max. :9.000
```

Our dependent variable is `quality`

. As we will be using it as a categorical variable in 3 of the 4 models, let’s look at its frequency distribution.

`table(wine$quality)`

```
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
```

Clearly, the categories at the extreme have very few observations. This will lead to problems in correctly categorizing these values. In order to overcome this problem, I will create two new variables.
For support vector machines (SVM) and multinomial logistic model (MNL), I will create a new variable labeled `quality.c`

which will be a factor variable with groups 3, 4, 8, and 9 combined in another group. We can label it anything we want as the labeling is meaningless for these methods. I will label this new group `3489`

thus preserving the knowledge that this group came from 4 separate groups.

For ordinal regression, we have a little bit more information about the ordering of the groups. I will create a new variable `quality.o`

. In this variable, I will combine 3, 4, and 5 and label it 5 to indicate this is 5 and lower. Similarly, I will combine 7, 8, and 9 and label it 7 to indicate that this group is 7 and above. Thus, we will effectively have only 3 groups.

Clearly, this will make model comparison a little bit tough but we have to give each model the best chance to perform even at this lower level of analysis.

```
wine <- wine %>%
mutate(quality.c = quality) %>%
mutate(quality.o = quality) %>%
mutate(quality.c = replace(quality.c,
quality %in% c(3,4,8,9), 3489)) %>%
mutate(quality.o = replace(quality.o,
quality %in% c(3,4), 5)) %>%
mutate(quality.o = replace(quality.o,
quality %in% c(8,9), 7)) %>%
mutate(quality.c = factor(quality.c)) %>%
mutate(quality.o = ordered(quality.o))
str(wine$quality.c)
```

`## Factor w/ 4 levels "5","6","7","3489": 1 1 1 2 1 1 1 3 3 1 ...`

`str(wine$quality.o)`

`## Ord.factor w/ 3 levels "5"<"6"<"7": 1 1 1 2 1 1 1 3 3 1 ...`

Now that we have new dependent variables, let’s take a look at the independent variables. As we are doing a predictive analysis (we have no plan to do inference), let’s understand the distribution and correlations of the variables and explore the need for transformation.

For this we will first get the descriptive statistics and correlations for all the numeric variables.

```
cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
as.data.frame(cormat)
```

```
cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
knitr::kable(as.data.frame(cormat),
options = list(pageLength = 12,
columnDefs = list(list(className = 'dt-center',
targets = c(1:12)))))
```

fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

fixed_acidity | 1 | |||||||||||

volatile_acidity | 0.22 | 1 | ||||||||||

citric_acid | 0.32 | -0.38 | 1 | |||||||||

residual_sugar | -0.11 | -0.2 | 0.14 | 1 | ||||||||

chlorides | 0.3 | 0.38 | 0.04 | -0.13 | 1 | |||||||

free_sulfur_dioxide | -0.28 | -0.35 | 0.13 | 0.4 | -0.2 | 1 | ||||||

total_sulfur_dioxide | -0.33 | -0.41 | 0.2 | 0.5 | -0.28 | 0.72 | 1 | |||||

density | 0.46 | 0.27 | 0.1 | 0.55 | 0.36 | 0.03 | 0.03 | 1 | ||||

pH | -0.25 | 0.26 | -0.33 | -0.27 | 0.04 | -0.15 | -0.24 | 0.01 | 1 | |||

sulphates | 0.3 | 0.23 | 0.06 | -0.19 | 0.4 | -0.19 | -0.28 | 0.26 | 0.19 | 1 | ||

alcohol | -0.1 | -0.04 | -0.01 | -0.36 | -0.26 | -0.18 | -0.27 | -0.69 | 0.12 | 0 | 1 | |

quality | -0.08 | -0.27 | 0.09 | -0.04 | -0.2 | 0.06 | -0.04 | -0.31 | 0.02 | 0.04 | 0.44 | 1 |

`library(ggcorrplot)`

`## Loading required package: ggplot2`

```
#ggcorrplot(, hc.order = TRUE, type = "lower", lab = F)
ggcorrplot(round(cor(as.matrix(wine[,-c(13,14,15)])),2),
p.mat = cor_pmat(as.matrix(wine[,-c(13,14,15)])),
hc.order = TRUE, type = "lower",
outline.col = "white",
ggtheme = ggplot2::theme_bw,
colors = c("#6D9EC1", "white", "#E46726"))
```

In the above heat map, the crosses indicate non significant correlations. From the correlations it seems that most variables have their own unique information set. However, it appears that `quality`

is strongly related to only a few variables. This is not great news.
At this point one can think of transformations to increase the correlations between the variables. However, I am not going to do it as this will make this exercise broader than I expected.

Let’s get the descriptive staistics

```
library(moments)
desc <- as.data.frame(cbind(Mean = sapply(wine[,-c(13,14,15)], mean),
Median = sapply(wine[,-c(13,14,15)], median),
Std.Dev = sapply(wine[,-c(13,14,15)], sd),
CV = sapply(wine[,-c(13,14,15)], sd) / sapply(wine[,-c(13,14,15)], mean),
Skewness = sapply(wine[,-c(13,14,15)], skewness),
Kurtosis = sapply(wine[,-c(13,14,15)], kurtosis)))
round(desc,2)
```

```
## Mean Median Std.Dev CV Skewness Kurtosis
## fixed_acidity 7.22 7.00 1.30 0.18 1.72 8.06
## volatile_acidity 0.34 0.29 0.16 0.48 1.49 5.82
## citric_acid 0.32 0.31 0.15 0.46 0.47 5.39
## residual_sugar 5.44 3.00 4.76 0.87 1.44 7.35
## chlorides 0.06 0.05 0.04 0.63 5.40 53.86
## free_sulfur_dioxide 30.53 29.00 17.75 0.58 1.22 10.90
## total_sulfur_dioxide 115.74 118.00 56.52 0.49 0.00 2.63
## density 0.99 0.99 0.00 0.00 0.50 9.60
## pH 3.22 3.21 0.16 0.05 0.39 3.37
## sulphates 0.53 0.51 0.15 0.28 1.80 11.65
## alcohol 10.49 10.30 1.19 0.11 0.57 2.47
## quality 5.82 6.00 0.87 0.15 0.19 3.23
```

The most interesting column for me is the CV (coefficient of variation). This is the ratio of standard deviation to the mean. We have certain observations where CV is very low (e.g., 0 or 0.05). This means that the standard deviation is extremely small compared to the mean. Clearly we need some scaling here to remove the effect of the mean. One way to do that is to mean center all the variables so that we have zero mean all across. It retains the low standard deviation, however. To overcome this issue, we can divide all the variables by their standard deviations. This way we will normalize our data such that all the variables will have mean = 0 and standard deviation = 1.

Will that help with reducing skewness and kurtosis? Think about it for a moment before you read on.

I am going to scale the numeric variables.

```
wine2 <- wine
wine2[,c(1:12)] <- scale(wine[,c(1:12)])
desc2 <- as.data.frame(cbind(Mean = sapply(wine2[,c(1:12)], mean),
Median = sapply(wine2[,c(1:12)], median),
Std.Dev = sapply(wine2[,c(1:12)], sd),
Skewness = sapply(wine2[,c(1:12)], skewness),
Kurtosis = sapply(wine2[,c(1:12)], kurtosis)))
round(desc2,2)
```

```
## Mean Median Std.Dev Skewness Kurtosis
## fixed_acidity 0 -0.17 1 1.72 8.06
## volatile_acidity 0 -0.30 1 1.49 5.82
## citric_acid 0 -0.06 1 0.47 5.39
## residual_sugar 0 -0.51 1 1.44 7.35
## chlorides 0 -0.26 1 5.40 53.86
## free_sulfur_dioxide 0 -0.09 1 1.22 10.90
## total_sulfur_dioxide 0 0.04 1 0.00 2.63
## density 0 0.06 1 0.50 9.60
## pH 0 -0.05 1 0.39 3.37
## sulphates 0 -0.14 1 1.80 11.65
## alcohol 0 -0.16 1 0.57 2.47
## quality 0 0.21 1 0.19 3.23
```

Scaling doesn’t affect skewness or kurtosis. In order to alter these two moments, we need to use non linear transformation such as logarithmic or square root transformations. I’m going to do it through trial and error.

Normal distribution has skewness = 0 and kurtosis = 3. `total_sulfur_dioxide`

, `pH`

, `alcohol`

, and `quality`

seem to have this shape (a good idea is to plot distributions). I am concerned about `fixed_acidity`

, `chlorides`

, `free_sulfur_dioxide`

, `density`

, and `sulphates`

due to high kurtosis (and skewness in some cases). Let’s take their log transform first and then scale these variables.

```
wine2[,c(1,5,6,8,10)] <- scale(log(wine[,c(1,5,6,8,10)]))
moments::skewness(wine2[,c(1,5,6,8,10)])
```

```
## fixed_acidity chlorides free_sulfur_dioxide
## 0.8889319 0.8762698 -0.8340045
## density sulphates
## 0.4672599 0.4048986
```

`moments::kurtosis(wine2[,c(1,5,6,8,10)])`

```
## fixed_acidity chlorides free_sulfur_dioxide
## 4.896783 5.305355 3.429675
## density sulphates
## 9.008338 3.701296
```

Except for `density`

the remaining 4 variables benefited from log transformation. After plotting the distribution for `density`

it appears that this might be because of a couple of extreme values. Although log transformation should have gotten rid of them, it seems it didn’t work out. So I replaced the two extreme values (there were 3 observations) with the next highest observation and then took log. As it turns out, the transformation paid off.

```
wine2$density <- ifelse(wine$density > 1.00369,
1.00369,
wine$density) # replace the values > 1.00369 by 1.00369
wine2$density <- scale(log(wine2$density))
moments::skewness(wine2$density)
```

`## [1] -0.02258351`

`moments::kurtosis(wine2$density)`

`## [1] 2.254121`

Now we have a dataset `wine2`

which has all the transformed variables. I am going to use it for the rest of the analysis.

___________ Terminated ___________