# Homework 4

## Instructions

The first part of this page explains how homework assignments will be handled and evaluated, since they are completed in groups. The questions for Homework 4 start further down, click this link to skip to that part of the page.

### Overview

As a group, solve the homework problems and write your answers in the R Markdown file homework_4.Rmd. Grades for the group submissions will, in addition to correctness, be based on document formatting, visualization quality, writing quality, and code style. This means that your group submission is to be written in the style of a exploratory data report, meaning:

• Each exercise must be written up using full sentences such that it is clear what question is being answered.

• There needs to be plain text above each code block explaining what you are doing, and the code blocks should be organized.

• The R Markdown file must knit without error and generate a PDF file, and the final PDF output must look nice, clean, and be easy to read.

### Participation

Credit for group participation will be determined using the following sources:

1. A CONTRIBUTIONS.md file distributed with your group repository

2. Commit history on GitHub

3. Discussion history in your group’s private Slack channel

Each group will need to fill out the CONTRIBUTIONS.md file as part of their submission. This file is where where each group member lists what he or she contributed to the final submission. Read the section Fill out the CONTRIBUTIONS.md file for more details on how this works.

If your group used an external document to coordinate and organize your work, such as a Google Doc, then that can also count as evidence of participation, provided that there is a visible writing history and it is possible to identify which student is responsible for each edit. This will require you to share the relevant file with the instructor with full privileges on the document so that it’s possible to review the document’s editing history. Please note that anonymous edits to Google Docs documents cannot be used as participation evidence, since there is no way to verify the account responsible for the added content. Also, for similar reasons, offline documents traded back and forth via email cannot be accepted as evidence of participation.

### How to answer the questions as a group

•   Read through all the problems individually. Then, as a group, discuss what will be needed to fully answer each question.

•   As a group, decide how you will split up writing responsibilities. A typical way to do this is to have each group member be responsible for writing out the full answer to a certain number of questions.

•   (Important) Before you start, make a copy of homework_4.Rmd and rename the copied file to include your last name. For example, if your last name is Smith, then your file copy should be renamed to homework_4_smith.Rmd.

•   Commit and push your copied file to GitHub.

•   Draft your contributions in your file. For example, if my last name was Smith and I agreed to write-up the answers to questions 4, 5, and 6, then I would open up homework_4_smith.Rmd and put my answers there. When I’m done, I would save my file, then commit and push my work to GitHub.

### How to edit and merge your answers into the group submission

While you will be writing your answers in separate files, eventually the group will need to merge in everyone’s answers into the main homework_4.Rmd document. The following checklist may help with this:

•   Select an editor to be in charge of merging everyone’s answers into the final document homework_4.Rmd. Because the editor needs to prepare the document for submission, it is reasonable if he or she contributes slightly less in terms of answering the questions (for example, if everyone else answers three questions, it would be okay if the editor answers two).

•   The editor should ensure that everyone has committed and pushed their answers to GitHub so they can copy and paste in each contribution.

•   The editor needs to make sure that the final submission reads like a coherent document and that the writing style and code style are uniform across all the answers. In other words, it should read like a single person answered all the questions, not a group of four people.

•   The editor will be in charge of of committing and pushing the final R Markdown file to GitHub, knitting to PDF, and uploading the PDF file on Blackboard.

### Fill out the CONTRIBUTIONS.md file

After everything is written up and ready for submission, the last thing the group will need to do is fill out the CONTRIBUTIONS.md file. By default, the file looks like this:

# Contributions to group submission

## Editor: FirstName LastName Member 1

## FirstName LastName Member 2

## FirstName LastName Member 3

## FirstName LastName Member 4

*   Questions answered:

At a minimum, you must remove the FirstName LastName Member entries in the template and fill in the names of the people in your group, indicate which group member served as the editor, and state which questions were written up by each member.

Additional information beyond this should be supplied, such as indicating when a group member helped another group member edit an answer or gave helpful comments in a Slack discussion. For example, one group member’s contribution list may read as follows:

## Jane Smith

*   Questions answered: 4, 5, 6
*   Helped with editing on answers 8 and 9
*   Collaborated with group member Jack Williams on answering question 10
*   Pointed out spelling errors and suggested fixes to the document layout in the merged group document

### Working with a GitHub repository as a group

You will likely encounter some issues while working in a group-based GitHub repository. In particular, you might find that when you click “Push” in the Git tab of RStudio, that it doesn’t seem to work and instead you get an annoying error message! This will happen when another member of your group has uploaded work before you did. While this can be irritating to deal with, this is actually a good thing, as it is GitHub’s way of protecting your files from accidential overwrites and deletions.

So what should you do to keep things running smoothly? First, always work in your own file, never in another person’s file. If you are not the editor, then you should not edit homework_4.Rmd either! Also, do not remove or rename any files that are not your own. Finally, when you are getting ready to work, following the procedure below should help keep the error messages to a minimum:

1. When you start working, you should begin by going to the Git tab and clicking “Pull” (notice this is not the same as “Push”). This will synchronize any new changes that your group may have made into your files.

2. Work on your file as normal. When you have completed your work, save your file.

3. Now we want to commit. But first, go to the Git tab and click “Pull” one more time to check for any other changes. Then, still in the Git tab, click the checkmark next to your updated file, type a message in the messagebox, and click the Commit button.

4. If the updated file is no longer in the list of files in the Git tab, then your commit was successful.

#### If the above advice doesn’t work…

If, even after following the advice below, you still encounter error messages when Pulling from and Pushing to GitHub, contact the course instructor for help.

### How to submit

The editor should follow the steps below to submit the homework for his/her group.

1. Make sure that everyone has committed and pushed their R Markdown files so that everything is synchronized to GitHub. If you do this right, then you will be able to view all the completed files on the GitHub website.

2. Knit your group’s R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Homework 4 posting on Blackboard.

## Market value of condominiums in New York City dataset

This dataset reports on the market valuations of condominiums in New York City for Fiscal Year 2011/2012. The data1 comes from the New York City Department of Finance and was made available to the general public on the NYC OpenData website (https://opendata.cityofnewyork.us/), which was subsequently cleaned and aggregated version by data scientist Jared Lander (https://www.jaredlander.com/data). The official description for the dataset is as follows:

Condominiums and cooperatives are valued as if they were residential rental apartments. Income information from similar rental properties is applied to determine value. The Department of Finance (DOF) chooses similar properties to value condos and coops. Properties are selected based on a combination of factors such as: land location, income levels, building age and construction and exemptions and subsidies.

The driving question for our analysis as follows:

What are key factors that affect the overall price of condominiums in New York City?

Thus, your main goal is to build a linear regression model that predicts the market value per squarefoot—the variable value_per_sqft—of condominiums in New York City. Note that you are not expected to find a regression model with 100% precision, instead our interest is in using models to uncover trends in the data.

When building and evaluating predictive models, it is standard protocol to split your dataset into a training dataset and a test dataset. This has already been done for you, with the training dataset loaded into the variable housing_train and the testing dataset loaded into housing_test. You will be using housing_train for most of the homework to build and cross-validate your models. Once you’ve selected your model, as a final step you will use it to predict the value_per_sqft column in the dataset stored in housing_test.

The dataset contains the 7 variables listed below.

Variable Description
boro Borough where building is located. New York City is divided into 5 boroughs, Manhattan, The Bronx, Brooklyn, Queens, and Staten Island.
neighborhood Neighborhood of building location. The neighborhood name is assigned by the New York City Department of Finance, and in most cases is the same as the neighborhood’s common name.
class Building classification code assigned by the New York City Department of Finance. There are four building classifications for the condominiums in the dataset, rental, walk-up, elevator, and co-op.
year_build Year the building was built
units Total number of units in the building
sqft Gross square footage of the building
value_per_sqft Total market value per squarefoot of the land and building

## Cross-validation helper functions

For your convenience, the helper function rep_kfold_cv(data, k, model, cv_reps) is loaded into your R environment and will run the code that cross-validates your models.2 This function requires four inputs:

Input Description
data The training dataset
k the number of folds to use for cross-validation
model Model to cross-validate written in the lm() formula syntax (money ~ work + time)
cv_reps Number of times to repeat cross-validation sequence to improve statistical averaging

## Questions

1. Create the following visualizations to explore the dataset:

• A histogram of value_per_sqft faceted over boroughs of New York City

• Box plots of units (y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a log10() scale along the y axis (see http://r4ds.had.co.nz/graphics-for-communication.html#replacing-a-scale for how to scale the axes)

• Box plots of sqft (y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a log10() scale along the y axis

• Scatterplots of value_per_sqft (y axis) versus units (x axis) using log10() scaling for units. Facet over two variables: boroughs and condominium classification.

Based on your plots so far, which variables (columns) in the dataset seem to have the strongest overall impact on the condominium values?

2. The box plots of units in the previous question should reveal extreme outliers in the plot. Since our goal is to model general trends and not precise values, fitting to these data points may skew our model in an unhelpful way. Filter the dataset to remove these outliers (there are 3 in all).

3. Build 4 different univariate (single variable) linear regression models using value_per_sqft as your response variable and either boro, class, units, or sqft as your explanatory variable. Plug these models into the k-fold cross-validation function rep_kfold_cv() and use k = 10 and cv_reps = 3 as your other inputs. Compare the mean-square error (MSE), both unadjusted and adjusted, and $$R^2$$ for these different models, noting that models with better predictive power will have lower MSE and higher $$R^2$$ scores. Which model did the best so far?

4. Build and cross-validate at least 3 multivariate models that predict value_per_sqft, using the k-fold cross-validation parameters k = 10 and cv_reps = 3. An example of a multivariate model is value_per_sqft ~ boro + units. You may also want to consider including interaction terms (see http://r4ds.had.co.nz/model-basics.html#interactions-continuous-and-categorical for a quick review). For example, you might try value_per_sqft ~ boro + class * sqft. Which of your models performs the best? Is it significantly better than your best model in the last question?

5. Now that you’ve selected your model, train it on the full dataset:

final_model <- lm(model_formula, data = housing_train) 

where model_formula is the best performing model from the previous question. Use final_model to calculate the mean-square error for predictions on the testing dataset:

final_model %>%
mse(housing_test)

This score is useful because it is absolute and allows you to compare how well your model performs against all other model types. Can you do better than a MSE score of 2030.34?

6. To wrap up, evaluate how well your model obeys the conditions for least squares linear regression, which are summarized on page 238 of the Introductory Statistics with Randomization and Simulation textbook. Make two plots to inspect how well your model conforms to the requirements for linear regression:

• To evaluate the residual spread, make a residual (y axis) versus predicted (x axis) scatterplot.

• To evaluate whether the residual distribution is nearly normal, make a residual Q-Q plot.

Explain whether your model obeys the conditions for least squares linear regression.

## Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this assignment:

1. For those that are interested in seeing how you would implement k-fold cross-validation using the tidyverse packages, the code for the function rep_kfold_cv() can be found in the file repeated_kfold_cross_validation.R distributed with your Github repo.