Image credit: Pexels and Pixabay

Some ggplot2 Features

In this post, I am covering a few things that I didn’t touch upon in the class previously. For this we will use mpg data from ggplot2 package. If you want to know more about the data and the variables, run the following command in your R console: ?ggplot2::mpg

Get structure of the mpg data

head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
## 2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
## 3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
## 4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
## 5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
## 6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

More on univariate plots

table(mpg$drv)
## 
##   4   f   r 
## 103 106  25
ggplot(mpg, aes(drv, hwy)) +
  geom_point()

Compare the frequency distribution to the scatterplot and notice that you have many more points output in the table. Why?

This is due to the overlapping points. A potential solution is to change the transparency of the points. Then overlapping points will look darker than non overlapping points.

ggplot(mpg, aes(drv, hwy)) +
  geom_point(alpha = 0.1)

Another solution is to use a jitter plot. ggplot2 will add random noise to the points and jitter them so that you can see them more clearly. In order to make your graph reproduceable, set the seed for random number generator.

set.seed(478)

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter()

Change the theme

ggplot2 has a few themes in built. The cleanest one is theme_bw. The plot below will make you go “WAT”.

ggplot(mpg, aes(cty, hwy)) +
  geom_text(aes(label = model), color = "red") +
  theme_bw()

Using aesthetics in geoms

The following two plots are identical. You can move aesthetics and data to any geom_ layer. The difference is that in the first case data and mapping applies automatically to all the layers appearing in the same plot. On the other hand, data and mapping specified in any geom_ will be limited to that geom alone and will not be passed to any other layer.

# Graph with data and aes in ggplot()

ggplot(data = mpg,
       mapping = aes(x = displ, y=hwy)) +
  geom_point() +
  labs(title = "data and mapping in ggplot")

# The same graph with data and aes in geom_point()
ggplot() +
  geom_point(data = mpg, 
             mapping = aes(x = displ, y=hwy)) +
  labs(title = "data and mapping in geom_plot")

As you saw, you can completely move the data and mapping into the geom layer. Is this a good idea in general? (Hint: No)

Also note that data and mapping within a geom overrides the data and mapping from ggplot. Here is an example. In the first plot, geom_point() inherits the same data and mapping from ggplot(). In the second plot, geom_point() inherits only data but not all the mapping. It inherits x = displ from ggplot() but specifies its own mapping for y = cty. However, the layer after that, geom_smooth() inherits everything from ggplot(). That explains why the smooth line is not passing through the points. More sepcifically, geom_point() is plotting cty on the y axis and geom_smooth is plotting hwy.

# Plot 1

ggplot(data = mpg, 
       mapping = aes(x = displ, y=hwy)) +
  geom_point() +
  labs(title = "Plot 1")

# Plot 2

ggplot(data = mpg,
       mapping = aes(x = displ, y=hwy)) +
  geom_point(mapping = aes(y = cty)) +
  geom_smooth() +
  labs(title = "Plot 2")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Changing attributes of points

All the aesthetics in aes have to be mapped to a variable in the data frame. In the following code, we map class of the cars to color.

ggplot(data = mpg, aes(x = displ, y = cty)) +
  geom_point(aes(color = class))

What if we map an aesthetic to a constant instead of a variable?

ggplot(data = mpg, aes(x = displ, y = cty)) +
  geom_point()

If we want to assign a constant color to any geom, we need to use color argument that belongs to geom function and not the aes function.

ggplot(data = mpg, aes(x = displ, y = cty)) +
  geom_point(color = "white")

alpha sets opacity of points with 0 = completely transparent and 1 = completely opaque. It helps when we have overlapping points. Let’s use an alpha = 1/10.

ggplot(mpg, aes(displ, hwy)) +
  geom_point(alpha = 1/10)

alpha provides you with another way to overcome the problem caused by overlapping points.

One can adjust the size of the points too. The default is size = 2. Let’s increase the size to 7 below.

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 8, color = "black") +
  geom_point(size = 7, alpha = 0.4, color = "white")

We can map size and alpha to any variable in the data.frame.

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(size = displ, alpha = displ),
             color = "blue") +
  theme_bw()

Using mapping and facets to generate multidimensional visualizations

Map colors to cyl

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = (cyl)))

The above plot is 3-D because it maps a third dimension, cyl, to point color. Similarly, you can increase the dimensions by mapping more variables to other aesthetics such as size and alpha. However, as you increase these dimensions, the graphs become difficult to interpret although they might look pretty.

Instead, we can make multiple plots and put them side by side. This is called faceting and ggplot2 implements this through two functions- facet_wrap and facet_grid.

Consider the following model:

\[ hwy_i = \beta_0 + \beta_1\cdot displ_i + \beta_2 \cdot class_i + \beta_3 \cdot displ_i \times class_i + \epsilon_i \]

(The above equation is written using the code in LATEX. In order to see it, you will have to render it in HTML or PDF.)

Here, we are using class as a moderating variable. In other words, the strength of the relationship between displ and hwy depends on the value of class. Thus, the slope of the line is not a constant and instead we have different lines for different values of class. In the graph below, we will map color to class

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point()

What do you think? Is there any indication that the slopes are different for each group? Perhaps we need to use geom_smooth to make it more obvious.

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 

Another way to visualize such interactions is by using facets Facets will create separate graphs for each level of the faceting variable. Below, we will facet based on class

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  facet_wrap(~class) +
  theme_bw()

We can free the scales in each facet using scales = "free" option. It’s not preferable when you are trying to compare slopes of the lines across the facets.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  facet_wrap(~class, scales = "free") +
  theme_bw()

Using facets to contrast group vs individual facet data

This example is from Hadley’s book.

ggplot(mpg, aes(displ, hwy)) +
  geom_point(data = dplyr::select(mpg, -class),
             color = "grey80", alpha = 0.5) +
  geom_point(aes(color = class), show.legend = FALSE) + 
  theme_bw() +
  facet_wrap(~class) 

Histograms with facet_wrap

Note that I am specifying ncol, i.e., number of columns.

ggplot(mpg, aes(displ, fill = drv)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~drv, ncol = 3) +
  theme_bw()

An example with facet_grid

With facet_grid you can create 2-D facets. Suppose, instead of just plotting histograms for drv, you also wanted to see how the distribution changes with cyl.

ggplot(mpg, aes(displ, fill = drv)) +
  geom_histogram(binwidth = 0.5, show.legend = FALSE) +
  facet_grid(cyl ~ drv) +
  theme_bw()

Related

comments powered by Disqus