CMSC320 Final Project: Airbnb Data Analysis

Hugo Burbelo

UID: 116014892

4/18/2020

Summary

This project is focused on listing data pulled from a sampling of Airbnb listings in Boston during 2019. In this analysis, I study the various price fluctuations among Airbnb listings in Boston, and determine if there is a relationship between the price of a listing and the reviews it recieves. The project uses Python as the core language, Pandas for data wrangliing, Seaborn for data visualization, and linear regression for the modeling.

Introduction

Airbnb has transformed over the past few years from a newcomer in the hospitality industry to the leader. As a home sharing platform that enables homeowners to put their homes online and earn money by letting travelers stay there, they have come to dominate the market. The price of an Airbnb can enter into the range of a comparative hotel suite, and as such pricing is a critical component of the Airbnb listing process. Additionally, the reviews of an Airbnb listing are also vital: it's very important that hosts earn positive reviews for the work they do and services they provide.

In this analysis, I have selected Boston as the target city due to the extensive participation among Boston homeowners in Airbnb listings and the metropolitan geographical area of Boston itself. The goal of the analysis is to study the various price fluctuations among Airbnb listings in Boston, and determine if there is a relationship between the price of a listing and the reviews it recieves.

Airbnb listings are a fantastic representation of the housing environment of a particular city, and the tech-based platform lends itself to be a fascinating source of great datasets. As such, I thought it would be an interesting area to explore further.

Installation of required libraries

In order to complete this tutorial yourself, you need to first install a few libraries with pip:

$ pip install pandas

$ pip install seaborn

$ pip install plotnine

$ pip install vaderSentiment

$ pip install statsmodels

Data collection

Where is the core Airbnb data for this tutorial located?

Dataset link: http://data.insideairbnb.com/united-states/ma/boston/2019-12-04/data/calendar.csv.gz

Why choose this specific Airbnb dataset?

This dataset is provided by Airbnb, and provides detailed calendar listing data for Airbnb listings in the city of Boston. Based off the listing ID (a unique identifier for each "listing"), data is provided for thousands of Airbnb listings in 2019.

I chose this dataset because I've stayed in Airbnbs before and am interested in how the listing prices and availability fluctuate over time. I am also planning a trip to Boston later this year, and it would be great to use any insights I can glean from this analysis for my planning. I also think that I can make some really interesting visualizations based off the time-based data.

An entity in this dataset is a single day's information for a particular Airbnb listing in Boston. For example, if my house (the "listing") is available on Airbnb in Boston from June 1 - June 10, then there would be 10 entities in the dataset for that listing. There are 1280055 entities in the dataset. There are 7 attributes in the dataset.

In [13]:
# load the dataset
import pandas as pd

df = pd.read_csv('calendar.csv')
print(df.head(10))
   listing_id        date available   price adjusted_price  minimum_nights  \
0        5506  2019-12-04         f  $79.00         $79.00               3   
1        5506  2019-12-05         t  $79.00         $79.00               3   
2        5506  2019-12-06         t  $79.00         $79.00               3   
3        5506  2019-12-07         t  $79.00         $79.00               3   
4        5506  2019-12-08         t  $79.00         $79.00               3   
5        5506  2019-12-09         t  $79.00         $79.00               3   
6        5506  2019-12-10         t  $79.00         $79.00               3   
7        5506  2019-12-11         t  $79.00         $79.00               3   
8        5506  2019-12-12         t  $79.00         $79.00               3   
9        5506  2019-12-13         t  $79.00         $79.00               3   

   maximum_nights  
0             730  
1             730  
2             730  
3             730  
4             730  
5             730  
6             730  
7             730  
8             730  
9             730  
In [14]:
# load listings data
listings = pd.read_csv('listings.csv')
print(listings.head(10))
      id                         listing_url       scrape_id last_scraped  \
0   3781   https://www.airbnb.com/rooms/3781  20200414051806   2020-04-14   
1   5506   https://www.airbnb.com/rooms/5506  20200414051806   2020-04-14   
2   6695   https://www.airbnb.com/rooms/6695  20200414051806   2020-04-14   
3   8789   https://www.airbnb.com/rooms/8789  20200414051806   2020-04-14   
4  10730  https://www.airbnb.com/rooms/10730  20200414051806   2020-04-14   
5  10813  https://www.airbnb.com/rooms/10813  20200414051806   2020-04-14   
6  10986  https://www.airbnb.com/rooms/10986  20200414051806   2020-04-14   
7  16384  https://www.airbnb.com/rooms/16384  20200414051806   2020-04-14   
8  18711  https://www.airbnb.com/rooms/18711  20200414051806   2020-04-14   
9  22195  https://www.airbnb.com/rooms/22195  20200414051806   2020-04-14   

                                                name  \
0                          HARBORSIDE-Walk to subway   
1     **$49 Special **   Private! Minutes to center!   
2                     $99 Special!! Home Away! Condo   
3                Curved Glass Studio/1bd facing Park   
4                     Bright 1bed facing Golden Dome   
5  Back Bay Apt-blocks to subway, Newbury St, The...   
6    Waterfront/North End Furnished Large Studio Apt   
7                Small Room in Cambridge Kendall MIT   
8  The Dorset | 3BR 1BA | Walk to the Redline and...   
9                      Copley House - Premium Studio   

                                             summary  \
0  Fully separate apartment in a two apartment bu...   
1  Private guest room with private bath, You do n...   
2  Comfortable,  Fully Equipped private apartment...   
3  Bright, 1 bed with curved glass windows facing...   
4  Bright, spacious unit, new galley kitchen, new...   
5  Stunning Back Bay furnished studio apartment. ...   
6  Chic furnished studio apartment is located on ...   
7                                                NaN   
8  Hospital Grade Sanitation Performed!  - Minimu...   
9  Copley House is comprised of apartments spread...   

                                               space  \
0  This is a totally separate apartment located o...   
1  **THE BEST Value in BOSTON!!*** PRIVATE GUEST ...   
2  ** WELCOME *** FULL PRIVATE APARTMENT In a His...   
3  Fully Furnished studio with enclosed bedroom. ...   
4  Bright one bed facing the golden dome of the S...   
5  Back Bay Studio Apt -  Private bath, A/C, Cabl...   
6  Waterfront Studio apt on Commercial St.   Tota...   
7  No Frills Accommodations in Cambridge - Kendal...   
8  Kennedy Library - Umass Boston - Castle Beach ...   
9  Our Premium Studio is ideal for guests looking...   

                                         description experiences_offered  \
0  Fully separate apartment in a two apartment bu...                none   
1  Private guest room with private bath, You do n...                none   
2  Comfortable,  Fully Equipped private apartment...                none   
3  Bright, 1 bed with curved glass windows facing...                none   
4  Bright, spacious unit, new galley kitchen, new...                none   
5  Stunning Back Bay furnished studio apartment. ...                none   
6  Chic furnished studio apartment is located on ...                none   
7  No Frills Accommodations in Cambridge - Kendal...                none   
8  Hospital Grade Sanitation Performed!  - Minimu...                none   
9  Copley House is comprised of apartments spread...                none   

                               neighborhood_overview  ... instant_bookable  \
0  Mostly quiet ( no loud music, no crowed sidewa...  ...                f   
1  Peacful, Architecturally interesting, historic...  ...                t   
2  Peaceful, Architecturally interesting, histori...  ...                t   
3  Beacon Hill is a historic neighborhood filled ...  ...                f   
4  Beacon Hill is located downtown and is conveni...  ...                f   
5  Wander around this quintessential neighborhood...  ...                f   
6                                                NaN  ...                f   
7                                                NaN  ...                f   
8  Once its own city, Dorchester is now Boston’s ...  ...                f   
9  Back Bay is one of Boston's finest neighborhoo...  ...                t   

  is_business_travel_ready          cancellation_policy  \
0                        f              super_strict_30   
1                        f  strict_14_with_grace_period   
2                        f  strict_14_with_grace_period   
3                        f  strict_14_with_grace_period   
4                        f  strict_14_with_grace_period   
5                        f  strict_14_with_grace_period   
6                        f  strict_14_with_grace_period   
7                        f                     flexible   
8                        f              super_strict_30   
9                        f                     flexible   

  require_guest_profile_picture require_guest_phone_verification  \
0                             f                                f   
1                             f                                f   
2                             f                                f   
3                             f                                f   
4                             f                                f   
5                             f                                f   
6                             f                                f   
7                             f                                f   
8                             f                                f   
9                             f                                f   

   calculated_host_listings_count  \
0                               1   
1                               6   
2                               6   
3                               8   
4                               8   
5                               9   
6                               9   
7                               1   
8                              39   
9                               6   

   calculated_host_listings_count_entire_homes  \
0                                            1   
1                                            6   
2                                            6   
3                                            8   
4                                            8   
5                                            9   
6                                            9   
7                                            0   
8                                           22   
9                                            1   

  calculated_host_listings_count_private_rooms  \
0                                            0   
1                                            0   
2                                            0   
3                                            0   
4                                            0   
5                                            0   
6                                            0   
7                                            1   
8                                           17   
9                                            5   

   calculated_host_listings_count_shared_rooms  reviews_per_month  
0                                            0               0.28  
1                                            0               0.79  
2                                            0               0.88  
3                                            0               0.35  
4                                            0               0.24  
5                                            0               2.74  
6                                            0               0.04  
7                                            0                NaN  
8                                            0               0.43  
9                                            0               0.17  

[10 rows x 106 columns]
In [15]:
# load review data
reviews = pd.read_csv('reviews.csv')
print(reviews.head(10))
   listing_id         id        date  reviewer_id reviewer_name  \
0        3781   37776825  2015-07-10     36059247          Greg   
1        3781   41842494  2015-08-09     10459388           Tai   
2        3781   45282151  2015-09-01     12264652        Damien   
3        3781   49022647  2015-09-30     41426327          Mike   
4        3781   52503327  2015-10-30     15151513          Ivan   
5        3781   63371931  2016-02-22     55456499     John-Mark   
6        3781   67909504  2016-04-01     55456499     John-Mark   
7        3781   97906006  2016-08-28     88528884         Elisa   
8        3781  105143774  2016-09-30      1342806        Nicole   
9        3781  129692749  2017-02-01     55126634        Carlos   

                                            comments  
0  The apartment was as advertised and Frank was ...  
1  It was a pleasure to stay at Frank's place. Th...  
2  The apartment description is entirely faithful...  
3  Thoroughly enjoyed my time at Frank's home. Ha...  
4  Great value for the money! This location has e...  
5  Frank was very accommodating throughout my sta...  
6  Excellent! This was my second stay at the Jeff...  
7  Frank was a wonderful and accommodating host. ...  
8  Frank was great, the apartment has everything ...  
9  Very nice. Comfortable apartment. Good locatio...  

Sentiment Analysis on Review Data

In order to understand how an Airbnb listing reviewer feels, we cannot rely on common attributes of their text-based review of the listing like the length or frequently, at least not exclusively. In order to gain a better picture of how reviewers truly feel about the listing, we run a basic sentiment analysis query against the text of their comment in order to gain insight into the reviewer's opinion.

We use the vader sentiment analysis package for this purpose, a very popular and simple sentiment analysis package for the Python language.

In [65]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer= SentimentIntensityAnalyzer()

def determine_sentiment(row):
    try:
        sentiment_result = (analyzer.polarity_scores(row['comments']))
        positive = sentiment_result['pos']
        negative = sentiment_result['neg']
        if(positive > .3): 
            return 'pos'
        elif(negative > .1):
            return 'neg'
        return 'neu'
    except:
        return 'neu'
reviews['sentiment'] = reviews.apply(determine_sentiment, axis=1)

print(reviews.head(100))
    listing_id         id        date  reviewer_id reviewer_name  \
0         3781   37776825  2015-07-10     36059247          Greg   
1         3781   41842494  2015-08-09     10459388           Tai   
2         3781   45282151  2015-09-01     12264652        Damien   
3         3781   49022647  2015-09-30     41426327          Mike   
4         3781   52503327  2015-10-30     15151513          Ivan   
..         ...        ...         ...          ...           ...   
95        5506  350194035  2018-11-18     58561712        Ashley   
96        5506  413964079  2019-02-18     18892960      Chankrit   
97        5506  425115071  2019-03-17    138063728          Mick   
98        5506  430450382  2019-03-30     16538334          Dean   
99        5506  434815683  2019-04-08     24150756        Rachel   

                                             comments sentiment  
0   The apartment was as advertised and Frank was ...       pos  
1   It was a pleasure to stay at Frank's place. Th...       neu  
2   The apartment description is entirely faithful...       neu  
3   Thoroughly enjoyed my time at Frank's home. Ha...       pos  
4   Great value for the money! This location has e...       neu  
..                                                ...       ...  
95  Awesome room at an unbeatable price! The only ...       pos  
96  The place was awesome for the price!! MUST sta...       pos  
97  Could not have chosen a better spot for my fir...       neu  
98  Easy to find. Just look for the intersection o...       neu  
99  The studio is quiet and charming, conveniently...       pos  

[100 rows x 7 columns]

Data cleaning

My pipeline analyzes the Airbnb listing data for Boston in mid-December 2019, and calculates the average listing price for each day in the desired date range.

First, we select only the columns we care about: the listing id, the date of the listing, and the price. Then we need to convert the date to a datetime type, and filter for only listings in our desired time range.

We then need to format the price data, removing dollar signs and commas from the price information. Finally, we group by the date of the listing and calculate the mean price for each date.

In [16]:
# select only the columns we care about: the listing id, the date of the listing, and the price.
df = df[['listing_id', 'date', 'price']]

# convert the date to a datetime type, and filter for only listings in our desired time range.
df["date"] = pd.to_datetime(df["date"])
df = df.loc[((df['date'] > '2019-12-10') & (df['date'] < '2019-12-30'))]

# strip dollar signs and commas from price info, and convert to float from string
df['price'] = df['price'].apply(lambda price: float(price.strip("$").replace(',', '')))

# group by date and calculate the mean price
dates = df.groupby("date")["price"].mean().reset_index()
dates['date'] = dates['date'].dt.strftime("%d")

# print a sample
print(dates.head())
  date       price
0   11  142.646707
1   12  141.661819
2   13  147.092957
3   14  148.628743
4   15  137.102082
In [21]:
# aggregate the listings dataset based on listing id to compute mean price per listing
prices = df.copy()
prices = prices.groupby("listing_id")["price"].mean().reset_index()

print(prices.head())
   listing_id       price
0        5506   79.000000
1        6695   99.000000
2        8789   85.210526
3       10730  100.000000
4       10811   84.000000
In [24]:
# merge average prices with average review count
prices = df.copy()
prices = prices.groupby("listing_id")["price"].mean().reset_index()

listings = listings.groupby("id")["reviews_per_month"].mean().reset_index()
print(listings.head())
      id  reviews_per_month
0   3781               0.28
1   5506               0.79
2   6695               0.88
3   8789               0.35
4  10730               0.24
In [25]:
# merge 2 dataframes
price_vs_reviews = prices.merge(listings, left_on='listing_id', right_on='id')
print(price_vs_reviews.head())
   listing_id       price     id  reviews_per_month
0        5506   79.000000   5506               0.79
1        6695   99.000000   6695               0.88
2        8789   85.210526   8789               0.35
3       10730  100.000000  10730               0.24
4       10813   94.210526  10813               2.74
In [57]:
# merge 2 dataframes to get price vs sentiment analysis
price_vs_sentiment = prices.merge(reviews, left_on='listing_id', right_on='listing_id')
price_vs_sentiment = price_vs_sentiment.truncate(after=1000)
In [49]:
print(price_vs_sentiment)
        listing_id      price         id        date  reviewer_id  \
0             5506  79.000000       1021  2009-03-21         8903   
1             5506  79.000000       1953  2009-05-01        12970   
2             5506  79.000000      10089  2009-09-14        14033   
3             5506  79.000000      18678  2009-11-29        51993   
4             5506  79.000000      24467  2010-01-19        68674   
...            ...        ...        ...         ...          ...   
135168    40540267  49.947368  587029500  2020-01-02     59248540   
135169    40540267  49.947368  595019726  2020-01-19    325666525   
135170    40540267  49.947368  606749708  2020-02-16     89426649   
135171    40540267  49.947368  613108010  2020-03-01    151706284   
135172    40540267  49.947368  614289516  2020-03-04    114593428   

       reviewer_name                                           comments  \
0              Jenny  Terry's Hotel Alterntv in Boston was a perfect...   
1              Clint  Warm and accommodating host. Beautiful and wel...   
2            Nicolas  We stay 3 nights at the Hotel Alternative and ...   
3              Sarah  We stayed in the private room.  Terry provided...   
4             LaJuan  Stayed here recently during a conference and t...   
...              ...                                                ...   
135168         Micah  Enjoyed the stay! Turana was quick to accept o...   
135169         Julia  Nice stay, available on short notice. Host rep...   
135170         Daria  Great location with awesome food options close...   
135171       Timothy  It's a nice place.  There are 5 rooms, but wit...   
135172         Harry  Overall this is worth the value and the locati...   

       sentiment  
0            pos  
1            pos  
2            pos  
3            neu  
4            neu  
...          ...  
135168       pos  
135169       neu  
135170       pos  
135171       neu  
135172       neu  

[135173 rows x 8 columns]

Exploratory Data Analysis

1) This plot shows the average listing price for each day in the desired date range, and therefore provides insight into the best day to book an Airbnb so you can get the best price.

In [29]:
from plotnine import *
(ggplot(dates, aes(x='date', y='price')) +
 geom_bar(stat='identity') +
 labs(y="Price ($)", x = "Date in December 2019") +
 ggtitle("Average Airbnb Listing Price in Boston 2019"))
Out[29]:
<ggplot: (301645681)>

2) In this next plot, we are able to visualize how the price of a listing influences the number of reviews per month it recieves.

In [31]:
import seaborn as sns
import matplotlib.pyplot as plt

ax = sns.scatterplot(x="price", y="reviews_per_month", data=price_vs_reviews)

3) Finally, we can visualize how the price of a listing impacts its review sentiment.

In [58]:
ax = sns.scatterplot(x="date", y="price", hue="sentiment", data=price_vs_sentiment)

Hypothesis testing and machine learning

Previously in the exploratory data analysis section, we were able to visualize how an Airbnb listing's price potentially impacts the reviews per month it recieves. In this analysis, we create a linear regression model that attempts to predict the number of reviews per month of a listing based on the price.

In [67]:
import statsmodels.formula.api as sm

sns.regplot(x="price", y="reviews_per_month", data=price_vs_reviews);


simple_res = sm.ols('price~reviews_per_month', data=price_vs_reviews).fit() 
simple_res.summary()
Out[67]:
OLS Regression Results
Dep. Variable: price R-squared: 0.012
Model: OLS Adj. R-squared: 0.012
Method: Least Squares F-statistic: 30.89
Date: Mon, 18 May 2020 Prob (F-statistic): 3.03e-08
Time: 23:07:57 Log-Likelihood: -15680.
No. Observations: 2474 AIC: 3.136e+04
Df Residuals: 2472 BIC: 3.138e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 155.4425 3.830 40.591 0.000 147.933 162.952
reviews_per_month -7.9908 1.438 -5.558 0.000 -10.810 -5.171
Omnibus: 4286.258 Durbin-Watson: 1.889
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7661946.565
Skew: 11.558 Prob(JB): 0.00
Kurtosis: 274.649 Cond. No. 3.98


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Thus, we have successfully created a linear regression model of the data, so we can better understand how price impacts the number of reviews per month a listing recieves.

Conclusion and further reading

The goal of the analysis was to study the various price fluctuations among Airbnb listings in Boston, and determine if there is a relationship between the price of a listing and the reviews it recieves. We have studied how the price of a listing impacts review sentiment, how listing prices fluctuate over the course of a month, and how we can predict the review count of a listing based on the price.

Recommended further reading on this topic:

1) Airbnb price analysis in TowardsDataScience: https://towardsdatascience.com/predicting-airbnb-prices-with-machine-learning-and-deep-learning-f46d44afb8a6

2) Pandas: http://pandas.pydata.org/pandas-docs/stable/

3) Seaborn: https://seaborn.pydata.org/

4) Statsmodels: http://www.statsmodels.org/stable/index.html

5) Airbnb datasets: http://insideairbnb.com/get-the-data.html