My family has lived in San Francisco, California for the past 20 years, and we have witnessed how much housing prices have increased over those years. According to this article, the median home value increased 90% from about $720,000 to $1.36 million, in the period between 2009 to 2019. Today, the majority of the properties in San Francisco cost over a million dollars. For those who want to own a home in this city, I’ve always wondered, “When would be the best time to buy, because prices seem to always go up? What type of property and where in the city would be more affordable? How much should buyers offer for their target properties?”
When Metis Data Science Bootcamp assigned us to build a linear regression model in our first solo project, I thought this was the opportunity for me to make use of the data science skills I’ve gained to answer questions I care about. To start any data science project, besides the questions we are trying to answer, we need a data set.
Create your data set by web-scraping
There are many websites that provide data sets for data science projects, such as Kaggle and UCI Machine Learning Repository. However, being able to generate your own data set is definitely a valuable skill to have. Not only can you customize your project by deciding on what information to collect, but you can also have more control over data quality. For me, learning to web-scrape with selenium and BeautifulSoup was just the most fun part of this project. From a property listing website, I used selenium to click through the links to collect the following information for each property: (1) zip code, (2) size, (3) number of beds, (4) number of baths, (5) year built, (6) lot size, (7) property type, (8) HOA, (9) date sold, (10) neighborhood, and (11) price sold. If you are interested in the details, the link to my Github repo for this project can be found at the end of this article.
Findings from Exploratory Data Analysis (EDA)
With the data set ready to go, I first cleaned the data by converting features to the appropriate data type, removing duplicates, and looking into missing data. The missing values were either missing completely at random or missing at random, and the project was done before I learned how easy it was to do multivariate imputation, so I just dropped the observations with missing values. Then I performed EDA. I found the price sold (the target in the data set) and size of the property had right skewed distributions, so I log transformed both to obtain normally distributed features.
From EDA, I found that the median property price seemed to stabilize since March 2020, when the COVID pandemic hit, although it seemed to start going up again in Jan 2021. From the bottom graph, I could see that as jobs became remote, more homes went on the market for sale. It made sense that as supplies increased, the price stalled.
Among the four property types in the data set, including condo, single-family house, multi-family house, and townhouse, condo was the most affordable option with average price to be about $1.25 million. The most affordable area in the city was Bayview, which is located on the southeast side of the city, with a median property price of $949,000.
Linear Regression to Predict Property Price in Log Scale
After EDA, it was time to build my base-line linear regression model with all the continuous features including number of beds, number of baths, size in log scale, HOA, and year built, to predict property price in log scale. For the base-line model, I just performed a simple train-test split. It turned out the model was doing a decent job in explaining the variance of the target, with R square 0.738 from the test set. I then turned the categorical features including property types and zip code into dummy variables before adding them to the model. The R square increased from 0.738 to 0.838. Finally, I tried to add interaction terms between all features to the model (except for zip code because it had too many categories) to see if I could improve model performance further. However, the interaction terms only bumped up the R square by 0.005 so I decided to omit them. I then fit a lasso regression with five fold cross validation, to trim down the model and also make sure the R square was properly validated on different sample splits. The final R square was 0.834 and the mean absolute error (MAE), which was the absolute difference between the actual property price in log scale and predictions, was 0.146. The most predictive feature in the model was the size of the property in log scale, followed by property type of single-family house. Both showed positive association with the property price.
The MAE seemed to be such a small number, so I should be very happy about my model, right? Not so fast, since my target was in log scale, I couldn’t just take it at face value. What did it mean to have an MAE value of 0.146 then? When a property had a price tag of $1.5million (1,500,000 is approximately exp(14.22)), the prediction from my model would be between $1.3million (approximately exp(14.22 - 0.146)) and $1.74million( approximately exp(14.22 + 0.146)). The difference between the actual price and the price predicted would be about $200,000 for a $1.5million property.
To check model performance across different property price ranges, I plotted the residuals, which was the difference between the property price in log scale and the predicted price, against the predicted price, shown in the graph below.
The residuals were randomly scattered indicating the model worked pretty well across different property prices in log scale.
For people who are thinking about purchasing properties in San Francisco, now is a good time to start property hunting as the price has seemed to stabilize in the past year. Another important factor is that interest rates reached historically low levels. Among the four property types we have in this data set, condo is the most affordable option. The most affordable area in the city is Bayview. Although the model has R square 0.834, the actual difference between price sold and price predicted is $200,000 for a $1.5million property. The model could be improved with more data and more property features in the future.