CDS 101: Introduction to Computational and Data Sciences

Instructions

Obtain the GitHub repository you will use to complete the lab, which contains a starter file named [lab02.Rmd]{.monospace}. This lab shows you how to use the [ggplot2]{.monospace} package to visualize datasets and demonstrates how visualization plays a crucial role in data exploration. Start by reading Why data visualization? to learn why data visualization is important and why the first lab focuses on it. Next, read the About the dataset section to get some background information on the dataset that you’ll be working with. Continue reading the lab instructions and complete any exercises using the provided spaces within your starter file [lab02.Rmd]{.monospace}. Then, when you’re ready to submit, follow the directions in the How to submit section below.

Why data visualization?

Why is data visualization an important topic? On the face of it, you might wonder why we need to dedicate any time to this topic. Aren’t plots really easy now that we all have computers? Isn’t making plots and figures one of the last things that we do for a project or lab report, after we’ve figured everything out? Why start with this? Since a picture (or visualization) is worth a thousand words, take a moment to explore the data visualizations linked below.

:::{.no-bullets} * Why do buses bunch?

U.S. Age Pyramid Becomes a Rectangle :::

@. After taking a few minutes to explore these visualizations, write at least one full sentence that explains one thing you noticed about either visualization that makes it effective at conveying information.

Visualizations have an important role to play in nearly every stage of a data science project. High-quality visualizations help people to understand your results and can activate their curiosity about your work and ideas. Creating visualizations in R is also easy and fun, and learning how to make them will help you become more comfortable with using R and RStudio. You will quickly see how simple it is to make colorful and eye-catching plots!

About the dataset

You will be exploring the famous dataset by Francis Galton for this week’s lab on data visualization. The data were transcribed by J.A. Hanley[@hanley:2004] who has published them at http://www.medicine.mcgill.ca/epidemiology/hanley/galton/. Francis Galton was developing ways to quantify the heritability of traits in the 1880s, and as part of this work he collected data on the heights of 898 adult children and their parents.

The table below provides descriptions of the dataset’s 6 variables,

Variable	Description
[family]	a category with levels for each family
[father]	the father’s height (in inches)
[mother]	the mother’s height (in inches)
[sex]	the child’s sex: [F]
[height]	the child’s height as an adult (in inches)
[nkids]	the number of adult children in the family, or, at least, the number whose heights Galton recorded.

Visualization by example

::: {.callout .primary} [How do I load the dataset?]{.h4}

The dataset should be automatically loaded into the variable [galton]{.monospace} for you in the R Markdown file for your lab report. To explore the dataset, type the following in your Console window,

View(galton)

If running this does not show you a data table, make sure you run the [setup]{.monospace} code block at the top of your R Markdown file first! :::

Before we discuss the general format for creating [ggplot2]{.monospace} plots, let’s play around with some examples:

@height-histogram. In your lab report, create an R code block that contains the following code:

```r
ggplot(data = galton) +
  geom_histogram(
    mapping = aes(x = height),
    bins = 30
  )
```

To run the code, either click the green "play" button in the upper right corner of the R code block or, while your cursor is inside the code block, press <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>Enter</kbd>.
This should create a plot called a histogram.

After creating the histogram, look at the [height]{.monospace} column in the data table you can view with `View(galton)` and compare it with the histogram.
Then, describe what the histogram is doing with the data in this column.

The input parameter [bins = 30]{.monospace} controls an important visual element in the histogram plot. Let’s experiment with the parameter in order to figure out what it does.

@histogram-bins-binwidth. Using the code you wrote in Exercise @height-histogram as a starting point, try setting the input keyword [bins]{.monospace} equal to something larger than the number 30, and then equal to something smaller than the number 30. This will create two plots. Then, change the input keyword from [bins]{.monospace} to [binwidth]{.monospace} and set its value equal to 1. Compare the plots and write a conclusion about what the [bins]{.monospace} and [binwidth]{.monospace} inputs control.

It is simple to add additional arguments to the aesthetic input aes() that change the way data are shown, which can reveal trends that were previously hidden from view. Let’s see what the [fill]{.monospace} argument does to our histogram:

@histogram-fill. Write the following code in your lab report:

```r
ggplot(data = galton) +
  geom_histogram(
    mapping=aes(x = height, fill = sex),
    binwidth = 1,
    alpha = 0.5,
    position="identity"
  )
```

Run the block and look at the output.
What did adding [fill = sex]{.monospace} as a new input in `aes()` do?
Does this change the way you might interpret the visualization?
What kinds of differences stand out now that we added this?

As you can now see, changing one of the inputs in your [ggplot2]{.monospace} code can have a substantial effect on the way your visualization looks. When a visualization reveals new information, we should describe and interpret it in our lab reports.

@histogram-analysis. Describe the shape of the male and female height distributions and estimate the value each distribution seems to be centered around. Does there appear to be a tangible difference in the average height for these two distributions? Based on what you know about the relative heights of people, is this a result that you would have expected to see? Why or why not?

Based on what you’ve seen so far, do you understand what all of the inputs are doing? The exercise below guides you through the process of exploring how the different inputs affect the plot’s look:

@exploring-inputs. A good way to figure out how R works is to experiment with inputs. What happens if you change the value of [alpha = 0.5]{.monospace} (keep it between 0 and 1). What happens if you remove the input [position = “identity”]{.monospace}? What happens if you replace it with [position = “dodge”]{.monospace}? What does it change in your output?

**Hint:** When moving the input, be careful with the commas!
A comma should separate each input.

The [ggplot2]{.monospace} syntax

Let’s take a small break from making plots and review the general syntax for creating a [ggplot2]{.monospace} figure. The command ggplot(), as you might have figured out already, creates the plot window. Commands with the prefix [geom_]{.monospace}—-such as geom_histogram()—-convert data points into different kinds of visualizations, and the command aes() controls the aesthetic properties of each [geom_]{.monospace} object. We then specify input parameters inside the parentheses of these commands, with each input separated from neighboring inputs by commas [,]{.monospace} and input names and values separated by the equals sign [=]{.monospace}. In general, visualizations in [ggplot2]{.monospace} have a predictable structure for the commands you need to enter. The general pattern is:

ggplot(data = <data>) +                     # Required in all ggplot2 visualizations
  geom_<function>(
    mapping = aes(<mapping>),               # Required in all ggplot2 visualizations
    stat = <stat>,                          # Optional, sensible default chosen
    position = <position>                   # Optional, sensible default chosen
  ) +
  coord_<function> +                        # Optional, sensible default chosen
  facet_<function> +                        # Optional, sensible default chosen
  scale_<function> +                        # Optional, sensible default chosen
  theme_<function>                          # Optional, sensible default chosen

In the above, anywhere you see a word [\<surrounded>]{.monospace} in angular brackets, you can replace it with one of several choices. Bare minimum, you must always specify the first two functions, all the rest are optional and have sensible defaults chosen for you. Also note that each major category—-[geom_]{.monospace} objects, [coord_]{.monospace} objects, [facet_]{.monospace} objects, [scale_]{.monospace} objects, [theme_]{.monospace} objects—-is added to the [ggplot2]{.monospace} “sentence” using the plus sign [+]{.monospace}. A helpful way to think about this is that you are building a figure using a series of layers: first you lay down the canvas ([ggplot]{.monospace}), then you visualize the data on a second layer using a certain style of plot ([geom_\<function>]{.monospace}), and afterward you tweak and polish the plot using additional layers to find-tune things. This layered approach allows you to create nice figures without much effort or the need to memorize dozens of commands.

Scatter-plots

Let’s further explore the data using another type of visualization, the scatter-plot.

@father-child-heights. Use the following code to create a scatter-plot of each person’s height as a function of their father’s height.

```r
ggplot(data = galton) +
  geom_point(
    mapping = aes(x = father, y = height)
  )
```

Here, [height]{.monospace} is the response (dependent) variable and [father]{.monospace} would be the explanatory (independent) variable.
Describe any trends that you see using full sentences.

Next, let’s try and create a plot that is similar in spirit to what we did in Exercise @histogram-fill so that we can see how the [height]{.monospace} variable depends on the [father]{.monospace} variable when the [sex]{.monospace} variable is taken into account. One important difference to know is that we need to use the word [color]{.monospace} as an input instead of [fill]{.monospace}. Otherwise, the procedure for grouping over the [sex]{.monospace} variable is basically the same.

@father-child-heights-color. Figure out how to group the data over the [sex]{.monospace} variable using the [color]{.monospace} input and create a new plot. What does this plot tell you about the relationship between the height measurements and the [sex]{.monospace} variable?

Faceting

Let’s introduce another new concept, the facet. Facets create visualizations with multiple panels, splitting things up across a categorical variable. Let’s apply this to our scatter-plot that we created in Exercise @father-child-heights.

@add-facet. Copy your code from Exercise @father-child-heights that created a scatter-plot. Add the following new command to your code snippet using the [+]{.monospace} sign:

```r
facet_grid(. ~ sex)
```

Describe what you get as output.
Then, create a new code block where you reverse the input for [facet\_grid]{.monospace} like so:

```r
facet_grid(sex ~ .)
```

What did adding [facet\_grid]{.monospace} do to your output, and what does the order of [. ~ sex]{.monospace} versus [sex ~ .]{.monospace} seem to do?
Is the information presented here any different from the information in **Exercise @father-child-heights-color**?

Modeling in [ggplot2]{.monospace}

You can actually create regression models using [geom_smooth]{.monospace}, which is another handy way to look for trends. You can choose from one of several kinds of regression methods. Here, we’ll use linear regression for creating our models (perhaps you remember drawing one of these “lines of best fit” by hand in a prior science or math class).

@add-smooth. Copy your code from Exercise @father-child-heights that created a scatter-plot. Add the following new command to your code snippet using the [+]{.monospace} sign:

```r
geom_smooth(
  mapping = aes(x = father, y = height),
  method = "lm"
)
```

Describe the line that you get as output.
Does it follow the trends (if any) you've previously described in the data?
What do you think the semi-transparent gray region around the line represents?

When creating a graph, you often want to make some touch ups after you’ve figured out what to plot. Do the following to add some extra polish to the plot you made in Exercise @add-smooth.

@add-annotations. Copy your code from Exercise @add-smooth and add the following command to your code:

```r
labs(
  title = "Height of children as adults versus the height of their fathers",
  x = "Father's height (in)",
  y = "Adult child's height (in)"
)
```

You should also alter the size of the scatterplot circles by adding [size = 2]{.monospace} as an input in [geom\_point]{.monospace}.

Additional exercises

@. Create a similar plot to Exercise @father-child-heights and make a scatter-plot of each person’s height as a function of their mother’s height. Describe any trends that you see using full sentences. If there’s a trend both here and in Exercise @father-child-heights, does it follow the same general direction or does it not? If the trends move in the same direction, then does one trend look stronger than the other? If so, which one? Explain how you know this using full sentences.

@. Start with your code from Exercise @add-facet and add a [geom_smooth]{.monospace} command that separately performs a linear regression on the male and female categories. Each subplot (or facet) should have a line in it. Additionally, polish the graph using what you learned in Exercise @add-annotations. Interpret your results and explain how well these visualizations explain the trends in the dataset, if any.

@. Compare the [geom_smooth]{.monospace} trend-lines for when used the father’s height only and when you used both the father’s height and whether the child was male or female. Do they both show the same trend? Is one trend more “powerful” than the other? Why or why not? Remember to respond by writing full sentences.

How to submit

To submit your lab, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!

Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 2 posting on Blackboard.

Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this lab:

Credits

This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The description of the [Galton]{.monospace} dataset was adapted from the documentation of the same dataset that’s available in the [mosaicData]{.monospace} R package and some of the lab exercises were adapted from problem sets found in Modern Data Science with R by Benjamin Baumer, Daniel Kaplan, and Nicholas Horton. All other exercises and instructions written by James Glasbrenner for CDS-102.

Lab 02 – Exploration by visualization: the Galton dataset