Introduction
Airbnb has had a remarkable rise in popularity over the past 11 years. Airbnb, the online marketplace for arranging and offering lodging, now lists more than 6 million rooms and houses in 81,000 cities [1]. The platform has more than 150 million users worldwide with 2 million people staying Airbnb rentals across the world on any given night [1].
Motivation and Background
Given the extremely large quantity of listings, finding a fair price for an Airbnb listing is a problem that every user has experienced. It is a problem that is immensely important to how both Airbnb users and property owners operate. The fact that 53% of travelers use Airbnb because of cost savings, means that large amount of rentals only occur because of their price [2]. Only 11% of the nearly 500,000 US listings are reserved on a typical night, so clearly there is a lot of room for improvement [2]. Our project attempts to answer the question: “How much is an Airbnb listing worth?” The answer to this question will allow both users and property owners to gauge the market value for a listing, making it easier for business to be conducted in a multi-billion dollar market [4].
Our Approach
We plan to answer the question “How much is an Airbnb listing worth?” by developing a supervised learning model to predict the price of an Airbnb listing based upon its features and classify it in a price range.
To do this we will:
- Pre-process the dataset and determine which features are usable.
- Discover the most important features through feature selection.
- Build an Airbnb listing price prediction model using different types of regression.
- Classify each Airbnb listing into the a price range.
The techniques we’ve used include:
- Feature Selection
- Univariate Feature Selection
- Recursive Feature Elimination (RFE)
- Tree-based Feature Selection
- Cross Validation
- Regression
- Linear Regression
- Lasso Regression
- Ridge Regression
- Binning
- Classification
- Decision Tree
- Support Vector Machine (SVM)
Data Description and Initial Exploration
We will be using a training dataset of 74,111 Airbnb listings across 6 major US cities (Boston, Chicago, DC, LA, NYC, SF). The dataset was taken from Kaggle’s “Deloitte Machine Learning Competition” [5].
There are a total of 29 features including: room_type
, bathrooms
, bedrooms
, beds
, review_scores_rating
, and neighbourhood
.
The feature we want to predict is log_price
.
Prices
The prices for the listings in the dataset are greatly skewed to the right as shown below. This causes a non-linear relationship between the price and features.
We take the natural logarithm of the price to make the effective relationship non-linear, while still preserving the linear model. This will be very important when testing different regression models. The results of this are shown below.
Methodology:
Data pre-processing
There are 28 features (excluding log_price
) in the raw dataset.
Initial Feature Elimination:
First, we manually eliminated some features we felt were difficult to use, difficult to enumerate, or unneccessary. The features removed were id
, amenities
, description
, first_review
, host_response_rate
, host_since
, last_review
, latitude
, longitude
, name
, and thumbnail_url
. This left us with 16 features to predict log_price
with.
Null Handling
Within the remaining 16 features, bathrooms
, host_has_profile_pic
, host_identity_verified
, beds
, bedrooms
, neighbourhood
, and review_scores_rating
all have null values or unintentional missing information. This was fixed by replacing all the null values with default values. We used 0 for bathrooms
, beds
, and bedrooms
and ‘F’/False for host_has_profile_pic
and host_identity_verified
.
neighbourhood
was also a feature with missing information, however there wasn’t a default value that could simply be assigned without throwing off the dataset. This required dropping all rows with missing values for neighbourhood
(6,872 rows).
review_scores_rating
was the only feature where having a missing value still provided valuable information. Having a missing value meant that there were no reviews for the Airbnb listing. We elected to handle this by also dropping all the rows with missing values for review_scores_rating
(16,772 rows). We had to drop these rows to preserve the linearity of the relationship between review_scores_rating
and log_price
. If we gave them all a default value of 0, the relation between the data would change.
In the end, the 74,111 rows were cut down to 52,343 rows after null handling.
Feature Enumeration
For our different supervised learning techniques to work, all of our features needed to be represented numerically (or enumerated).
Features like: bathrooms
, beds
, bedrooms
, number_of_reviews
, accommodates
, and review_scores_rating
are all numeric values, so they require no modification.
The boolean features: cleaning_fee
, host_has_profile_pic
, host_identity_verified
, and instant_bookable
can simply be transformed from True/’T’ and False/’F’ and enumerated to 1 and 0 respectively.
The other features which are categorical features: property_type
(35 types), room_type
(3 types), bed_type
(5 types), cancellation_policy
(6 types), city
(6 values), and neighbourhood
(619 values) can be represented numerically in different ways.
The way we handled enumerating these features was by assigning a number for each unique type of the feature, which is also known as Label Encoding.
Feature Selection
Since we have many features, it made sense to take a closer look at the data to see which ones were most important. To do this, we used three different feature selection methods: Univariate Feature Selection, Recursive Feature Elimination, and Tree-based Feature Selection. Each method scored each feature and we took those scores and averaged them together. The features with the highest scores were the features that were most important to our dataset. Let’s take a closer look at each method.
Univariate Feature Selection
Univariate Feature Selection is effective when trying to obtain a better understanding of the data and allowed us to select the top features to improve our model. This is because Univariate Feature Selection works by scoring each feature based upon univariate statistical tests. We used f_regression for our scoring function, which is a univariate linear regression test used to estimate the degree of linear dependency between two random variables. Below are the results from performing this test on our dataset.
Recursive Feature Elimination
Recursive feature elimination takes in an external estimator (in this case we used a linear estimator) and selects the best features for that estimator. It does this by recursively evaluating the importance of each feature based upon the estimator and then removing the least important feature. It stops removing features once it has reached the desired number of features remaining. Using the rankings of each feature, we were able to how important each feature was to the dataset, which can be seen below.
Tree-based Feature Selection
For tree-based feature selection, we used the extra-trees classifier. This classifier uses an estimator that fits a randomized number of decision trees on various sub-samples of the dataset. To perform this type of feature selection, we used our binned log_price
values (more on this below) in order to be able to perform classification. Using the feature importances given by the estimator, we were able to tell which features were most important, as shown below.
We took the scores/importances from each of our methods, averaged them, and then found which of those averages were the highest. The higher the average, the more important the feature. See the image below that shows the average score/importance of each feature.
This left us with the following features: accommodates
, beds
, bathrooms
, review_scores_rating
, cancellation_policy
, bedrooms
, room_type
, instant_bookable
, city
, and bed_type
.
Cross Validation
We used 10-Fold Cross Validation to test the effectiveness and overfitting of our models.
Regression Models
Linear Regression
We used a Linear Regression model to fit our data and got an average MSE of 0.18577 for all features and 0.18768 for the top 10 features.
The linear model showed consistent MSE values across each fold of cross validation, suggesting minimal overfitting.
Lasso Regression
When using a Lasso Regression model, we found that the MSE decreased as the regularization constant approached 0 (essentially giving the linear regression least squares solution and MSE above).
Ridge Regression
Similarly for Ridge Regression, the lowest MSE value was observed when the regularization constant approached 0. However, it converged faster than Lasso Regression.
Classification
Binning Methods
Oftentimes, finding an exact price isn’t extremely important when surveying an Airbnb listing. A price range can be just as telling. To do so we discretized the log price into four categories: low
, medium
, high
, very high
.
We used both equal frequency and equal width binning methods. Equal frequency makes it so that each bin has roughly the same number of items. Equal width splits the bins such that the width of the intervals are the same.
Decision Trees
Classification Process
We tested depths of [2,6,10,14,18,22] and 10 and 6 provided the highest accuracies for equal frequency and equal width methods, respectively.
Now that we have discrete labels, we can fit our data on a Decision Tree Classifier. The equal frequency method resulted in accuracies of 0.56556 for all features and 0.55163 for top 10 features. The equal width method resulted in 0.88396 and 0.88340.
Support Vector Machine (SVM)
Classification Process
We also used a Support Vector Machine fit our data, using the one-versus-one strategy as we have a multi-class problem. The equal frequency method using the RBF kernel resulted in accuracies of 0.39740 for all features and 0.51300 for top 10 features. The equal width method resulted in 0.86250 and 0.87072, while also using the RBF kernel. These results are compared below.
We also tested our SVM using different kernels while still using the one-verses-one multi-class strategy. Above we showed our results for all of our features and our top 10 features using the RBF kernel. After testing SVM using the RBF kernel, we wanted to see what kind of results we could produce using other kernels, namely the polynomial kernel and the sigmoid kernel. Below are the results from modeling our SVM with those kernels, fitted with all of our features and our top ten features.
Final Results
Linear MSE
Lasso and Ridge MSE
Expected vs. Actual (All Features)
Expected vs. Actual (Top 10 Features)
Decision Tree Accuracy
SVM Accuracy
Conclusion
The linear model performed best out of all regression methods. Ridge and Lasso work to reduce overfitting by adding penalties to large coefficients in the model. It is clear that the linear model does not overfit as the predicted vs. actual graphs are not perfectly linear but follow a general trend. Thus, the MSE values for both Ridge and Lasso decreased as the regularization constant approached 0, making them linear models. It is important to note that Ridge and Lasso regression might perform better on data not represented in the training and test datasets. However, Airbnb price listings are usually in the same range, especially after taking the log, so it is unlikely to see that.
Both supervised learning models performed better when we discretized the labels with equal width bins. This is expected as equal width binning will group related points together better than equal frequency when the data has very dense sections. Since equal frequency binning tries to make each group roughly the same size, there is potential to label a point differently than points very close to it because of the size constraint. By taking the log, the prices are brought closer together. This resulted in close points being labeled differently. Thus, the predicted label will oftentimes be different than the actual because of inconsistent binning.
The Decision Tree preformed marginally better than the Support Vector Machine. The tree did not significantly outperform the SVM, but it may have had the edge because the dataset has discrete features. Support vector machines are generally used on numeric data. Enumerating discrete features may have hindered the SVM’s performance slightly.
In all methods, there was no significant difference in MSE/Accuracy between using all features or top 10 selected features. This is because the small amount of features (16) will not cause overfitting, so selecting the top 10 did not make the models perform better.
So now to our question: “How much is an Airbnb listing worth?”. Our models did a reasonably good job at predicted/classifying an Airbnb listing price given a set of features. So if you are planning a trip or thinking about renting your place, you now know a good bargaining price is 😎.
Looking Back
Looking at our dataset and our results from our supervised learning algorithms, we have been able to identify a few things that we could have done differently that may have given us better results. This mainly relates to our data pre-processing methods and the way we handled our various features.
amenities
: this feature could have been converted into an array of all the individual amenities with one-hot (True, False) representations.description
,name
: these features could have used NLTK (Natural Language Toolkit) to do sentiment analysis to determine how positive each was.first_review
,host_since
,last_review
: the date features could have been converted “time since a date” features.latitude
,longitude
: could have been converted into additional features (ex. distance from the city center, walkability, transit score) or used additional APIs (ex. Google Maps).thumbnail_url
: could have converted into having or not having a thumbnail (0 or 1).
Target/Mean Encoding and One-Hot Encoding could have been used to obtain greater accuracy with a lower MSE, with the tradeoff of possibly overfitting or running into the dummy variable trap.
Contributions
- Prices: Henry Harris
- Pre-Processing and Feature Engineering: Austin Cho
- Feature Selection
- Univariate Feature Selection: Chasse Rush
- Recursive Feature Elimination (RFE): Chasse Rush
- Tree-based Feature Selection: Chasse Rush
- Regression
- Linear Regression: Austin Cho/Damian Henry,
- Lasso Regression: Austin Cho/Damian Henry,
- Ridge Regression: Austin Cho/Damian Henry,
- Binning: Damian Henry/Henry Harris
- Classification
- Decision Trees: Damian Henry/Henry Harris
- Support Vector Machine (SVM): Chasse Rush
- GitHub Pages Setup: Chasse Rush
- GitHub Pages Editing: Austin Cho/Damian Henry/Chasse Rush/Henry Harris
References
[1] J. Bustamante, “Airbnb Statistics” IPropertyManagement, November, 2019. [Online]. Available: IPropertyManagement, https://ipropertymanagement.com/airbnb-statistics.
[2] “Airbnb by the Numbers: Usage, Demographics, and Revenue Growth” MuchNeeded [Online]. Available: MuchNeeded, https://muchneeded.com/airbnb-statistics. [Accessed November 17, 2019]
[3] L. Hohnholz, “Behind the scenes of Airbnb: The stats & facts 2018” eTurboNews, October 12, 2018. [Online]. Available: eTurboNews, https://www.eturbonews.com/235339/behind-the-scenes-of-airbnb-the-stats-facts-2018. [Accessed November 17, 2019]
[4] A. Wilhelm, “Quick Notes On Airbnb’s Revenue Growth, Huge Cash Reserves” CrunchbaseNews, August 19, 2019. [Online]. Available: https://news.crunchbase.com/news/quick-notes-on-airbnbs-revenue-growth-huge-cash-reserves. [Accessed November 17, 2019]
[5] R. Mizrahi, “AirBnB Listings in Major US Cities.” Kaggle, March 14, 2018. [Online]. Available: Kaggle, https://www.kaggle.com/rudymizrahi/airbnb-listings-in-major-us-cities-deloitte-ml.