Content and Source of Data

This dataset contains the house sales record in Brooklyn, New York City, from 2003 to 2017. It is a conbination of two datasets from the NYC Rolling Sales and the PLUTO files, which are merged and presented here as a single .csv file. It has 353546 rows and 111 features, where we are interested in predicting sale_price using other covariates.

A detailed glossary can be found in the folder of competition materials.

Objectives

The aim of this study is to predict the sale prices of the Brooklyn properties in the test set, using data provided and/or other data available online (excluding those from the original source of this dataset).

We will judge on your prediction accuracy based on the quadratic loss function. The interpretability of your model and the clarity of your explanation in the written report are also important.

You should output your prediction results into a .csv file and upload it, together with all the codes used, onto the designated folder, and zip it.

Overview

Before diving into the analysis, let us look through the dataset to see which features are included and if any cleaning is necessary.

Below we print the names of the features whose proportion of missing values (NA’s) is higher than 80%. One could immediately see that a large percentage of values is missing in some features, e.g. easement and HistDist. These columns are thus very unlikely to be useful in our modelling precess, so it is recommended to exclude them from our subsequent analysis. On the other hand, the columns with a small proportion of missing values should be treated more carefully, and note also that there are both numerical and categorical entries in the dataset. Furthermore, features such as Version, which is the version number relates to the release of PLUTO, would obviously have little impact on any predictions made. We further note that, since this is a combination of datasets from two different sources, some features are repeated.

It is suggested to apply some dimensionality reduction methods, e.g. principle component analysis, in carrying out further feature selection. Make sure you aslo look over the glossaries and understand the meaning of the features involved (especially the response variable, sale_price).

b.dat <- as.data.frame(read.csv('./brooklyntrainset.csv'))
b.colNames <- colnames(b.dat)
b.naSum <- as.numeric(sapply(b.dat, function(y) sum(is.na(y))) / nrow(b.dat))
names(b.naSum) <- b.colNames
print(b.naSum[b.naSum > 0.8])
##   easement  ZoneDist2  ZoneDist3  ZoneDist4   Overlay1   Overlay2 
##  1.0000000  0.9721734  0.9997850  0.9999972  0.9164974  0.9999745 
##    SPDist1    SPDist2    SPDist3  LtdHeight  OwnerType        Ext 
##  0.9179682  0.9999661  1.0000000  0.9873934  0.8825217  0.8469195 
##   HistDist   Landmark     ZMCode  EDesigNum    APPDate FIRM07_FLA 
##  0.9523739  0.9997907  0.9876537  0.9934945  0.9579687  0.9823531 
## PFIRM15_FL 
##  0.9364411

One might also interested in which features are numerical or categorical. The codes below returns theose categorical variables and the number of their levels (categories). Note that three of the variables only have one level, meaning they would have no contribution to the sales prediction. We therefore remove these predictors from the data.

cat.numb <- sapply(b.dat[,sapply(b.dat, is.factor)], nlevels)
print(cat.numb)
##            neighborhood building_class_category               tax_class 
##                      64                      88                      11 
##          building_class                 address        apartment_number 
##                     183                  209055                    5262 
##  building_class_at_sale               sale_date                 Borough 
##                     184                    5371                       1 
##                FireComp                SanitSub                 Address 
##                     102                      33                  142809 
##               ZoneDist1               ZoneDist2               ZoneDist3 
##                      78                      54                      10 
##               ZoneDist4                Overlay1                Overlay2 
##                       1                       9                       5 
##                 SPDist1                 SPDist2               LtdHeight 
##                      18                       2                       1 
##               SplitZone               BldgClass               OwnerType 
##                       2                     163                       5 
##               OwnerName                     Ext              IrrLotCode 
##                  131480                       3                       2 
##                HistDist                Landmark                 ZoneMap 
##                      35                      46                      32 
##                  ZMCode                 Sanborn               EDesigNum 
##                       1                    1525                      96 
##                 APPDate                 Version 
##                    1990                       1
# Identify and remove features with only one level.
rm.colNames <- names(cat.numb[cat.numb == 1])
b.dat1 <- b.dat[, !(names(b.dat) %in% rm.colNames)]
cat('The categorical features with only one level are: ', rm.colNames, sep = '\n')
## The categorical features with only one level are: 
## Borough
## ZoneDist4
## LtdHeight
## ZMCode
## Version

Our next step could be checking for normality and outliers. Here, as an trivial example, we apply the full ordinary linear regression model, i.e. fitting model with all predictors and taking sale_price as the response variable. We also exclude the features with more than 80% missing values to refine the model. Below are some diagnostics plots obtained from this model.