CDS 101: Introduction to Computational and Data Sciences

Instructions

The first part of this page explains how homework assignments will be handled and evaluated, since they are completed in groups. The questions for Homework 4 start further down, click this link to skip to that part of the page.

Overview

As a group, solve the homework problems and write your answers in the R Markdown file [homework_4.Rmd]{.monospace}. Grades for the group submissions will, in addition to correctness, be based on document formatting, visualization quality, writing quality, and code style. This means that your group submission is to be written in the style of a exploratory data report, meaning:

Each exercise must be written up using full sentences such that it is clear what question is being answered.
There needs to be plain text above each code block explaining what you are doing, and the code blocks should be organized.
The R Markdown file must knit without error and generate a PDF file, and the final PDF output must look nice, clean, and be easy to read.

Participation

Credit for group participation will be determined using the following sources:

A [CONTRIBUTIONS.md]{.monospace} file distributed with your group repository
Commit history on GitHub
Discussion history in your group’s private Slack channel

Each group will need to fill out the [CONTRIBUTIONS.md]{.monospace} file as part of their submission. This file is where where each group member lists what he or she contributed to the final submission. Read the section Fill out the [CONTRIBUTIONS.md]{.monospace} file for more details on how this works.

Google Docs

If your group used an external document to coordinate and organize your work, such as a Google Doc, then that can also count as evidence of participation, provided that there is a visible writing history and it is possible to identify which student is responsible for each edit. This will require you to share the relevant file with the instructor with full privileges on the document so that it’s possible to review the document’s editing history. Please note that anonymous edits to Google Docs documents cannot be used as participation evidence, since there is no way to verify the account responsible for the added content. Also, for similar reasons, offline documents traded back and forth via email cannot be accepted as evidence of participation.

How to answer the questions as a group

The following is a checklist you may follow to help you get started with answering the questions as a group:

::: {.no-bullets} * Read through all the problems individually. Then, as a group, discuss what will be needed to fully answer each question.

As a group, decide how you will split up writing responsibilities. A typical way to do this is to have each group member be responsible for writing out the full answer to a certain number of questions.
(Important) Before you start, make a copy of [homework_4.Rmd]{.monospace} and rename the copied file to include your last name. For example, if your last name is Smith, then your file copy should be renamed to [homework_4_smith.Rmd]{.monospace}.
Commit and push your copied file to GitHub.
Draft your contributions in your file. For example, if my last name was Smith and I agreed to write-up the answers to questions 4, 5, and 6, then I would open up [homework_4_smith.Rmd]{.monospace} and put my answers there. When I’m done, I would save my file, then commit and push my work to GitHub. :::

How to edit and merge your answers into the group submission

While you will be writing your answers in separate files, eventually the group will need to merge in everyone’s answers into the main [homework_4.Rmd]{.monospace} document. The following checklist may help with this:

Select an editor to be in charge of merging everyone’s answers into the final document [homework_4.Rmd]{.monospace}. Because the editor needs to prepare the document for submission, it is reasonable if he or she contributes slightly less in terms of answering the questions (for example, if everyone else answers three questions, it would be okay if the editor answers two).
The editor should ensure that everyone has committed and pushed their answers to GitHub so they can copy and paste in each contribution.
The editor needs to make sure that the final submission reads like a coherent document and that the writing style and code style are uniform across all the answers. In other words, it should read like a single person answered all the questions, not a group of four people.
The editor will be in charge of of committing and pushing the final R Markdown file to GitHub, knitting to PDF, and uploading the PDF file on Blackboard.

Fill out the [CONTRIBUTIONS.md]{.monospace} file

After everything is written up and ready for submission, the last thing the group will need to do is fill out the [CONTRIBUTIONS.md]{.monospace} file. By default, the file looks like this:

# Contributions to group submission

## Editor: FirstName LastName Member 1

*   Questions answered:

## FirstName LastName Member 2

*   Questions answered:

## FirstName LastName Member 3

*   Questions answered:

## FirstName LastName Member 4

*   Questions answered:

At a minimum, you must remove the [FirstName LastName Member]{.monospace} entries in the template and fill in the names of the people in your group, indicate which group member served as the editor, and state which questions were written up by each member.

Additional information beyond this should be supplied, such as indicating when a group member helped another group member edit an answer or gave helpful comments in a Slack discussion. For example, one group member’s contribution list may read as follows:

## Jane Smith

*   Questions answered: 4, 5, 6
*   Helped with editing on answers 8 and 9
*   Collaborated with group member Jack Williams on answering question 10
*   Pointed out spelling errors and suggested fixes to the document layout in the merged group document

Working with a GitHub repository as a group

You will likely encounter some issues while working in a group-based GitHub repository. In particular, you might find that when you click “Push” in the Git tab of RStudio, that it doesn’t seem to work and instead you get an annoying error message! This will happen when another member of your group has uploaded work before you did. While this can be irritating to deal with, this is actually a good thing, as it is GitHub’s way of protecting your files from accidential overwrites and deletions.

So what should you do to keep things running smoothly? First, always work in your own file, never in another person’s file. If you are not the editor, then you should not edit [homework_4.Rmd]{.monospace} either! Also, do not remove or rename any files that are not your own. Finally, when you are getting ready to work, following the procedure below should help keep the error messages to a minimum:

When you start working, you should begin by going to the Git tab and clicking “Pull” (notice this is not the same as “Push”). This will synchronize any new changes that your group may have made into your files.
Work on your file as normal. When you have completed your work, save your file.
Now we want to commit. But first, go to the Git tab and click “Pull” one more time to check for any other changes. Then, still in the Git tab, click the checkmark next to your updated file, type a message in the messagebox, and click the Commit button.
If the updated file is no longer in the list of files in the Git tab, then your commit was successful.
Click “Push” to upload your changed file.

If the above advice doesn’t work…

If, even after following the advice below, you still encounter error messages when Pulling from and Pushing to GitHub, contact the course instructor for help.

How to submit

The editor should follow the steps below to submit the homework for his/her group.

Make sure that everyone has committed and pushed their R Markdown files so that everything is synchronized to GitHub. If you do this right, then you will be able to view all the completed files on the GitHub website.
Knit your group’s R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Homework 4 posting on Blackboard.

Market value of condominiums in New York City dataset

This dataset reports on the market valuations of condominiums in New York City for Fiscal Year 2011/2012. The data¹ comes from the New York City Department of Finance and was made available to the general public on the NYC OpenData website (https://opendata.cityofnewyork.us/), which was subsequently cleaned and aggregated version by data scientist Jared Lander (https://www.jaredlander.com/data). The official description for the dataset is as follows:

Condominiums and cooperatives are valued as if they were residential rental apartments. Income information from similar rental properties is applied to determine value. The Department of Finance (DOF) chooses similar properties to value condos and coops. Properties are selected based on a combination of factors such as: land location, income levels, building age and construction and exemptions and subsidies.

The driving question for our analysis as follows:

What are key factors that affect the overall price of condominiums in New York City?

Thus, your main goal is to build a linear regression model that predicts the market value per squarefoot—-the variable [value_per_sqft]{.monospace}—-of condominiums in New York City. Note that you are not expected to find a regression model with 100% precision, instead our interest is in using models to uncover trends in the data.

When building and evaluating predictive models, it is standard protocol to split your dataset into a training dataset and a test dataset. This has already been done for you, with the training dataset loaded into the variable [housing_train]{.monospace} and the testing dataset loaded into [housing_test]{.monospace}. You will be using [housing_train]{.monospace} for most of the homework to build and cross-validate your models. Once you’ve selected your model, as a final step you will use it to predict the [value_per_sqft]{.monospace} column in the dataset stored in [housing_test]{.monospace}.

The dataset contains the 7 variables listed below.

Variable	Description
[boro]	Borough where building is located. New York City is divided into 5 boroughs, Manhattan, The Bronx, Brooklyn, Queens, and Staten Island.
[neighborhood]	Neighborhood of building location. The neighborhood name is assigned by the New York City Department of Finance, and in most cases is the same as the neighborhood’s common name.
[class]	Building classification code assigned by the New York City Department of Finance. There are four building classifications for the condominiums in the dataset, rental, walk-up, elevator, and co-op.
[year_build]	Year the building was built
[units]	Total number of units in the building
[sqft]	Gross square footage of the building
[value_per_sqft]	Total market value per squarefoot of the land and building

Cross-validation helper functions

For your convenience, the helper function rep_kfold_cv(data, k, model, cv_reps) is loaded into your R environment and will run the code that cross-validates your models.² This function requires four inputs:

Input	Description
[data]	The training dataset
[k]	the number of folds to use for cross-validation
[model]	Model to cross-validate written in the [lm()]
[cv_reps]	Number of times to repeat cross-validation sequence to improve statistical averaging

Questions

@. Create the following visualizations to explore the dataset:

*   A histogram of [value\_per\_sqft]{.monospace} faceted over boroughs of New York City

*   Box plots of [units]{.monospace} (y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a `log10()` scale along the y axis (see <http://r4ds.had.co.nz/graphics-for-communication.html#replacing-a-scale> for how to scale the axes)

*   Box plots of [sqft]{.monospace} (y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a `log10()` scale along the y axis

*   Scatterplots of [value\_per\_sqft]{.monospace} (y axis) versus [units]{.monospace} (x axis) using `log10()` scaling for [units]{.monospace}.
    Facet over two variables: boroughs **and** condominium classification.

Based on your plots so far, which variables (columns) in the dataset seem to have the strongest overall impact on the condominium values?

@. The box plots of [units]{.monospace} in the previous question should reveal extreme outliers in the plot. Since our goal is to model general trends and not precise values, fitting to these data points may skew our model in an unhelpful way. Filter the dataset to remove these outliers (there are 3 in all).

@.
Build 4 different univariate (single variable) linear regression models using [value_per_sqft]{.monospace} as your response variable and either [boro]{.monospace}, [class]{.monospace}, [units]{.monospace}, or [sqft]{.monospace} as your explanatory variable. Plug these models into the k-fold cross-validation function rep_kfold_cv() and use [k = 10]{.monospace} and [cv_reps = 3]{.monospace} as your other inputs. Compare the mean-square error (MSE), both unadjusted and adjusted, and $R^2$ for these different models, noting that models with better predictive power will have lower MSE and higher $R^2$ scores. Which model did the best so far?

@.
Build and cross-validate at least 3 multivariate models that predict [value_per_sqft]{.monospace}, using the k-fold cross-validation parameters [k = 10]{.monospace} and [cv_reps = 3]{.monospace}. An example of a multivariate model is [value_per_sqft ~ boro + units]{.monospace}. You may also want to consider including interaction terms (see http://r4ds.had.co.nz/model-basics.html#interactions-continuous-and-categorical for a quick review). For example, you might try [value_per_sqft ~ boro + class * sqft]{.monospace}. Which of your models performs the best? Is it significantly better than your best model in the last question?

@.
Now that you’ve selected your model, train it on the full dataset:

```r
final_model <- lm(model_formula, data = housing_train) 
```

where [model\_formula]{.monospace} is the best performing model from the previous question.
Use [final\_model]{.monospace} to calculate the mean-square error for predictions on the *testing* dataset:

```r
final_model %>%
  mse(housing_test)
```

This score is useful because it is absolute and allows you to compare how well your model performs against all other model types.
Can you do better than a MSE score of 2030.34?

@.
To wrap up, evaluate how well your model obeys the conditions for least squares linear regression, which are summarized on page 238 of the Introductory Statistics with Randomization and Simulation textbook. Make two plots to inspect how well your model conforms to the requirements for linear regression:

*   To evaluate the residual spread, make a residual (y axis) versus predicted (x axis) scatterplot.

*   To evaluate whether the residual distribution is nearly normal, make a residual Q-Q plot.

Explain whether your model obeys the conditions for least squares linear regression.

Cheatsheets

You are encouraged to review and keep the following cheatsheets handy while working on this assignment:

Data set aggregated from the following sources:
https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Manhattan/dvzp-h4k9
https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Brooklyn-/bss9-579f
https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Queens-FY/jcih-dj9q
https://data.cityofnewyork.us/Property/DOF-Condominium-Comparable-Rental-Income-Bronx-FY-/3qfc-4tta
https://data.cityofnewyork.us/Finances/DOF-Condominium-Comparable-Rental-Income-Staten-Is/tkdy-59zg ↩
For those that are interested in seeing how you would implement k-fold cross-validation using the [tidyverse]{.monospace} packages, the code for the function rep_kfold_cv() can be found in the file [repeated_kfold_cross_validation.R]{.monospace} distributed with your Github repo. ↩