# Mini-homework 4 – Flights of New York

## Instructions

Obtain the GitHub repository you will use to complete the mini-homework, which contains a starter file named flights_of_new_york.Rmd. This mini-homework will help you become more familiar with manipulating datasets using R by guiding you through some examples. Read the About the dataset section to get some background information on the dataset that you’ll be working with. Each of the below exercises are to be completed in the provided spaces within your starter file flights_of_new_york.Rmd. Then, when you’re ready to submit, follow the directions in the **How to submit** section below.

## About the dataset

The dataset we are working with in this mini-homework is the flights dataset that is loaded via the nycflights13 R package. This dataset contains on-time data for all flights that departed from a New York City airport (airport codes: JFK, LGA, EWR) in 2013.

## Exercises

Start working on these exercises **after** you have successfully cloned your repository into RStudio Server.

As always, you should inspect a new dataset to become more familiar with what it contains. To do this in RStudio, type the following command

**in your**(not in the R Markdown file):*Console*window`View(flights)`

Since the flights dataset is loaded as an R package via the

`library(nycflights13)`

in your setup code block, additional documentation is available. To read more about the dataset, type the following command**in your**:*Console*window`?flights`

Use these resources to answer these initial questions.

How many rows and columns does this dataset have?

What does a single row in this dataset represent?

What is the difference between the information contained in the dep_time and sched_dep_time columns?

Which columns contain information about dates and times?

Airplanes are reused across many different flights. Which columns would be helpful to use in identifying individual airplanes?

Let’s start with the

`select()`

function. The`select()`

function*selects*columns from a dataset, which is useful when you’re working with a dataset that contains dozens of variables. Try running the following code:`flights %>% select(year, month)`

Based on the output, explain what happens when you run this command.

What does %>% mean?

The strange looking symbol %>% is called the

**pipe**, and it is a handy way to pass a dataset through a chain of commands. All examples going forward will use the pipe whenever possible.There are multiple ways to specify columns with the

`select()`

command. To demonstrate this, copy the code you wrote in**Exercise 2**and replace`select(year, month)`

with`select(year:day)`

. What does the colon : do?The sort operation is a common and indispensible operation for organizing data, and the function from dplyr that allows us to sort is called

`arrange()`

.`arrange()`

sorts columns with textual data (chr data type) into alphabetical order and sorts numerical data into numerical order.Try running the following code:

`flights %>% arrange(month, day)`

Based on the output, does it look like both the month and day columns were sorted? Which column was sorted first? What happens if you reverse the order you specify the columns in

`arrange()`

?Important!

Students often find that there is a

*base R*function named`sort()`

as well, leading to confusion later in the course. Using`sort()`

the same way you use`arrange()`

will give you errors. For the duration of course**we will always prefer to use the tidyverse version of a function**, so always use`arrange()`

and pretend`sort()`

doesn’t exist.By default,

`arrange()`

will sort data in ascending order. The function`desc()`

can be used to sort in descending order. For example, to sort by months in reverse order, we would use`arrange(desc(month))`

Let’s use it to answer a simple question about the dataset: what flight experienced the longest departure delay? Identify the column that gives information about flight departure delays, and then use`arrange()`

in combination with`desc()`

to sort the flights dataset to find the flight with the longest departure delay.Let’s now try an example that uses the

`mutate()`

function, which is a little more complex.`mutate()`

lets us**transform**a dataset by applying the same operation to each row in the dataset and appending the results as a*new*column. More simply, this is how you can add, subtract, multiply, and divide different two or more columns against each other. As an example, run the following command to calculate the average speed for each flight in miles per hour:`flights %>% mutate( average_speed = distance / (air_time * 60) )`

Where does the new column you just computed show up in the dataset and what is the name of this new column? What part of the code is controlling the name of the new column?

We are not limited to only one input at a time in

`mutate()`

. As long as we separate each input by a comma, we can put as many inputs as we want in the`mutate()`

function! As an example, let’s take the`dep_time`

column, which gives the clock time as an integer (for example, 11:00am is 1100) and separate it into an hours column and minutes column:`flights %>% mutate( dep_time_hour = dep_time %/% 100, dep_time_minute = dep_time %% 100 )`

Next, copy and paste this code into a new code block, add a second comma, and then add a third line.

**Have this third line use the**Name this third column`dep_time_hour`

and`dep_time_minute`

columns to compute the number of minutes since midnight.`dep_time_minutes_midnight`

.What do %/% and %% mean?

The operators %/% and %% let you do

**modular arithmetic.**The symbol %/% means*integer division*and %% means*remainder*. As a quick example, you might remember learning about proper and improper fractions in math class. \(\dfrac{9}{4}\) is an example of an improper fraction. To convert it into a proper fraction, first we need to figure out how many times 4 goes into 9. We can easily compute this with integer division:`9 %/% 4`

`# [1] 2`

We also need to know the remainder, or how much is left over after dividing 4 into 9.

`9 %% 4`

`# [1] 1`

So the proper fraction representation of \(\dfrac{9}{4}\) is \(2 \frac{1}{4}\).

Next we consider the

`filter()`

function, which provides a ruled-based way to keep a subset of rows and remove the rest. Here we just consider rules that are simple comparisons, which involve the symbols:- >: greater than
- >=: greater than or equal to
- <: less than
- <=: less than or equal to
- !=: not equal
- ==: equal

Give

`filter()`

a try by running the following two code blocks:`flights %>% filter( arr_delay < 0 )`

`flights %>% filter( carrier == "UA" )`

After running the above code, figure out how to combine the two examples in one

`filter()`

function (**hint:**it resembles what you did with`mutate()`

). This will tell you all the flights operated by United Airlines (UA) that arrived early.It is common to want to summarize the information contained within a dataset, such as computing sums and averages, or counting how many data points belong to different groups. This is called data aggregation, as it

**aggregates**many data points together and uses them to compute a cetain quantity. We perform data aggregation in R by using the commands`group_by()`

and`summarize()`

, which frequently show up as a pair.The

`group_by()`

command is applied to one or more columns, and allows you create groups that share common values in a column of categorical data. The`summarize()`

command can then be used to compute quantities like averages**within**those groups. It will help to see an example. Run the following code, which calculates the mean (average) arrival delay for each airline carrier:`flights %>% group_by(carrier) %>% summarize( average_arr_delay = mean(arr_delay, na.rm = TRUE) )`

Which airline carrier had the longest arrival delays on average? Which airline carrier had the shortest arrival delays on average?

## How to submit

To submit your mini-homework, follow the two steps below. Your homework will be graded for credit **after** you’ve completed both steps!

Save, commit, and push your completed R Markdown file so that everything is synchronized to GitHub. If you do this right, then you will be able to view your completed file on the GitHub website.

Knit your R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to

*Mini-homework 4*posting on Blackboard.

## Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this mini-homework:

## Credits

**nycflights13 data set source:** Hadley Wickham. 2018. *nycflights13*: Flights that Departed NYC in 2013. https://cran.r-project.org/web/packages/nycflights13/index.html

The example in Exercise 6 and the example and question of Exercise 7 sourced from Chapter 5 of *R for Data Science* by Garrett Grolemund and Hadley Wickham, used under CC BY-NC-ND 3.0.