Obtain the GitHub repository you will use to complete the lab, which contains a starter file named lab04.Rmd. This lab introduces you to the concepts of tidy data and how you can reshape your dataset to take advantage of the tidyverse tools. Carefully read the lab instructions and complete the exercises using the provided spaces within your starter file lab04.Rmd. Then, when you’re ready to submit, follow the directions in the How to submit section below.
As a review, the principles of Tidy Data are visually represented in the figure below:
The three panels are an illustration of the following three rules,
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
It is worth emphasizing that there is a difference between an untidy dataset and a dirty dataset. Tidying a dataset means reshaping it by transposing the rows and columns until the format matches the criteria outlined in the above rules, which then allows us to more easily use the ggplot2 and dplyr functions to analyze and visualize a dataset. Cleaning a dirty dataset means that you are fixing misspellings, data entry errors, and dealing with other irregularities in the raw data. This lab is about showing you how to reshape a dataset to make it tidy.
About the dataset
This week’s dataset is a gene expression dataset that comes from a real-world research project that tested the effect of starvation and growth rate on baker’s yeast . The following quote, sourced from a discussion about this paper , describes the meaning of this dataset well:
Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA — what we call being “expressed”. (If a gene is never turned into RNA, it may as well not be there at all). This provides a sort of “cellular switchboard” that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.
Brauer 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast S. cerevisiae, a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).
“Starving” the yeast of these nutrients lets us find genes that:
- Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
- Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.
The table below provides descriptions about the dataset’s 40 variables,
|GID||One kind of ID for each gene|
|YORF||Yeast Open Reading Frame (a kind of sub-unit of genetic information)|
|GWEIGHT||The paper doesn’t make this clear, but all entries are 1|
|[GNP][0.05 ≤ x ≤ 0.30]||The letters G, N, and P represent the restricted nutrient. The decimal value is the yeast growth rate.|
Note that the NAME column contains the following information separated by the double bar || symbols:
|Gene name||for example, SFB2. Not all genes have a name here.|
|Biological process||for example, “proteolysis and peptidolysis”|
|Molecular function||for example, “metalloendopeptidase activity”|
|Systematic ID||for example, YNL049C. Every gene has one of these unique IDs.|
|Unknown ID Number||for example, 1082129. The paper doesn’t explain what these mean.|
Like in previous labs, it’s recommended that you take a first look at the dataset by viewing it by running
View(brauer) in your Console window.
The tidyr package
Reshaping the gene expression dataset will require us to use two functions found in the tidyr package,
separate(). Let’s review how each of these functions works with the extended example from Chapter 12.6 in the R for Data Science textbook.
library(tidyverse) command at the top of our RMarkdown file loads many packages and example datasets for us, which includes a dataset from the World Health Organization that is stored in the variable
who. The first few lines of the
who dataset are:
Using gather() to transpose columns into rows
In this data frame, the names of the 57 columns starting with new_sp_m014 and ending with newrel_f65 each refer to a set of three categories, violating the first rule for tidy data. This can easily be fixed by transposing these columns into rows using the
who1 <- who %>% gather( new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE )
After applying the
gather() operation, the first few rows in the dataset now look as follows:
As you can see, we’ve taken the 57 category columnms and converted them into categories underneath a single column named key with their corresponding values placed underneath the column cases.
To summarize, the syntax for
gather() is as follows:
dataset %>% gather( ..., # Columns you want to gather into rows key = "...", # Variable for storing names of gathered columns value = "..." # Variable for values stored under gathered columns )
Using separate() to split one column into many
The other function you will need to use is
separate(). This function takes values in a single column and splits them out into multiple columns. This is used when you have a dataset that doesn’t follow the third rule of tidy data.
Going back to the who example, after you gathered the 57 columns into a single column, we might wonder what the values under key mean. If you run
?who, you’ll see that the underscores … separate different variable values. This means that the key column contains 3 values per cell, not one, so this is a case for using
To separate the columns, we run the following:
who2 <- who1 %>% mutate( key = str_replace(key, "newrel", "new_rel") ) %>% separate( col = key, into = combine("new", "type", "sexage"), sep = "_" )
You can ignore the second line with a
mutate() command for now, as this just fixes the missing underscore that you need to do before running separate.
The first few lines of the separated dataset are:
As you can see, this has successfully split our one column into three.
To summarize, the syntax for
separate() is as follows:
dataset %>% separate( col = ..., # Name of column to separate into = combine(...), # Names for new columns formed after separation sep = ..., # Specifies the separator symbols convert = ... # If TRUE, tries to set data type for new columns )
Your should note that the into keyword needs to be specified as a vector of strings that you can create using
Prune the dataset
When dealing with larger datasets that have many columns, it is helpful if you can identify any columns that are extraneous and will not be used for analysis. This shrinks the overall size of the dataset so that it takes up less space in the computer’s memory. For this dataset, the GID, YORF, and GWEIGHT columns aren’t particularly important for any kind of analysis, so they can be removed.
- Remove the GID, YORF, and GWEIGHT variables from the dataset and assign the result to the variable brauer2.
Enforcing one observation per row
Now we will employ a systematic approach while tidying this dataset by going through the rules one by one and determining what (if anything) needs to be reshaped. We start with the first rule that “each variable must have its own column.” The columns G0.05 through U0.3 each contain the results of different experimental trials, telling us which nutrient was restricted and what growth rate was selected for the yeast. Knowing this, we conclude that this dataset violates the first rule, as each row contains multiple experimental trials.
gather()to transpose the columns G0.05 through U0.3 into a series of rows to satisfy the rule that “each variable must have its own column. Use the label sample for the new categorical variable created by
gather()that contains the gathered column names and use the label expression for the new numerical variable containing the values originally under the G0.05 through U0.3 columns. Assign your result to the variable brauer3.
After completing this step, the one observation per row rule will be satisfied.
Enforcing one column per variable and one value per cell
The second and third rules are intertwined, and so fixing one will help to satisfy the other. Looking at the version of the dataset in brauer3, we can clearly see that the one value per cell rule is violated by the NAME and sample columns. These are violations that can be fixed using the
separate() function. Let’s separate the sample column we created with the
gather() function first.
If you don’t want the
separate() function to delete any individual letters or symbols when splitting a column, you can set the sep argument equal to an integer, for example sep = 1. This tells
separate() to split the column right after the first letter.
- Separate the sample column that you created in Exercise 2 into two variables, nutrient and rate. After separation, the data type for the rate column should be double (<dbl>), not character. Assign your result to the variable brauer4.
Now we turn to the NAME column where each value is separated by two bars || and there are five variables contained in each cell. Using the information from the about the dataset section, we deduce that we should name the new columns gene_name, biological_process, molecular_function, systematic_id, and number.
If you try to use sep = “||” in
separate(), it will not work as expected. By default, the bar symbol | is interpreted as the boolean operator OR, which should be familiar from creating rules for
filter(). To get around this, we can preceed | with backslashes to signal to R that we want it to read | as a symbol and not as the boolean operator OR. Thus, we should use sep = “\\|\\|” to define our separator.
separate()to split the values in the NAME column into five columns named gene_name, biological_process, molecular_function, systematic_id, and number. Assign your result to the variable brauer5.
We’re nearly done! If you inspect the dataset with
View(brauer5), you’ll notice that the text under some of the columns has quotation marks with unnecessary blank space. For example, ” molecular function unknown “. Removing this blank space (typically called “white space”) is called trimming, and we can use the
str_trim() function for this purpose.
Remove the whitespace from the five new columns you created using
mutate_at()in combination with
brauer5 %>% mutate_at( vars(gene_name:systematic_id), str_trim )
Assign your result to the variable brauer_tidy.
Now we’re done!
Visualizations using the tidy dataset
Having a tidy dataset makes it easy to use the rest of the tidyverse packages, such as ggplot2. To demonstrate this, we will now create a plot that a biologist would use to explore gene expression trends that emerge due to restricting different nutrients and enforcing different growth rates.
- Filter brauer_tidy so that it only contains entries with gene_name equal to LEU1. Then, create a line plot where you place rate on the horizontal axis, expression on the vertical axis, and assign different colors to nutrient.
What are the dataset’s dimensions (the number of rows and columns) after completing the tidying procedure? Do you find it easier to read the data in this format, or did you think the original format was easier to understand? Why or why not?
In Exercise 6 we were able to easily create a line plot of expression as a function of rate for the different kinds of nutrients for one of the genes in the microarray. What specifically did the tidying procedure do to make it simple for us to create this plot? Could we still create the same plot with ggplot2 using the untidy version of the dataset? If so, try to write the ggplot2 code for it. If not, identify what it is about the untidy dataset that prevents you from creating the same plot.
How to submit
To submit your lab, follow the two steps below. Your lab will be graded for credit after you’ve completed both steps!
Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.
Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Lab 4 posting on Blackboard.
You are encouraged to review and keep the following cheatsheets handy while working on this lab:
This lab is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Exercises and instructions written by James Glasbrenner for CDS-102.
 M. J. Brauer, C. Huttenhower, E. M. Airoldi, R. Rosenstein, J. C. Matese, D. Gresham, V. M. Boer, O. G. Troyanskaya, and D. Botstein, “Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast,” Molecular Biology of the Cell 19, 352 (2007).
 David Robinson, “Cleaning and Visualizing Genomic Data: A Case Study in Tidy Analysis,” Variance Explained (2015).