Introduction

Airbnb is a rapidly growing platform used worldwide by customers to provide unique travel experiences. Like any large company, Airbnb utilizes data analytics for a wide variety of reasons including for business performance, marketing, and predictions. Data analytics are extremely useful in this sector because companies like Airbnb use this data to improve their customers’ experiences and grow their markets. It is also helpful for Airbnb users to determine how to plan their trips. Understanding the city analytics can help users decide which part of a city they want to explore, how to budget their living expenses and overall explore the reliability of the listing and the hosts based on the reviews.

Tutorial

In this tutorial, we will use several R libraries and services to analyze Airbnb data. Specifically, we will be analyzing the listings in Washington D.C. By analyzing and breaking down this data, we will be able to answer questions and recognize trends and patterns relating to Airbnb listings. The following topics will be covered:

Installing Libraries
Data Acquisition and Formatting
Grouping and Plotting Data
Expressing Data on a Map
Regression Analysis
Hypothesis Testing: Logisic Regression and Predictive Modeling
Summary and References

1. Installing Libraries

First, we need to install the necessary libraries and packages to conduct this tutorial. To make this project replicable, these libraries are public and are available for use with RStudio version 3.4.4. We will install the packages with

$ install.packages(“packageName”)

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)
library(stringr)
library(leaflet)
library(broom)

require(rvest)

## Loading required package: rvest

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

2. Data Acquisition & Formatting

Step one of the data pipeline starts here. Airbnb shares their datasets directly for analytics purposes. These datasets are split according to the city of travel, how many listings are available for that city, an extensive understanding of the reviews for each listing, overview of the neighborhoods and much more. For our tutorial, we decided to look further into the Airbnb options for Washington D.C. travelers. We directly downloaded the most recent Washington D.C. listings.csv file, dated to be updated on 04/24/20. We prepare the dataset by loading the file and observing the nature of the data.

Specifically, in the listings dataset, we have access to the following information: hosts, host IDs, name of the listings, the neighborhood of the listing, the exact latitude and langitude of the listing, the type of room, the price per night, how many minimum nights are required for the user, the number of reviews for that listing, the last posted review of the listing, how many reviews are written per month, how many listings there are per host, the number of days the listing is available for in a year, and the room type id.

We obtained our dataset from the following website: http://insideairbnb.com/get-the-data.html.

# We will import the data with the "read_csv" function.

listings <- read_csv("C:/Users/ishaa/Downloads/listings_a.csv")

## Parsed with column specification:
## cols(
##   id = col_double(),
##   name = col_character(),
##   host_id = col_double(),
##   host_name = col_character(),
##   neighbourhood_group = col_logical(),
##   neighbourhood = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   room_type = col_character(),
##   price = col_double(),
##   minimum_nights = col_double(),
##   number_of_reviews = col_double(),
##   last_review = col_character(),
##   reviews_per_month = col_double(),
##   calculated_host_listings_count = col_double(),
##   availability_365 = col_double(),
##   roomtype_id = col_double()
## )

listings

## # A tibble: 9,342 x 17
##       id name  host_id host_name neighbourhood_g~ neighbourhood latitude
##    <dbl> <chr>   <dbl> <chr>     <lgl>            <chr>            <dbl>
##  1  3362 Conv~    2798 Ayeh      NA               Shaw, Logan ~     38.9
##  2  3663 Clas~    4617 Shawn & ~ NA               Brightwood P~     39.0
##  3  3670 Beau~    4630 Sheila    NA               Howard Unive~     38.9
##  4  3686 Vita~    4645 Vita      NA               Historic Ana~     38.9
##  5  3771 Mt. ~    4795 Charlene  NA               Columbia Hei~     38.9
##  6  4002 2 Be~    5143 Anthony   NA               North Michig~     38.9
##  7  4197 Bedr~    5061 Sandra    NA               Capitol Hill~     38.9
##  8  4501 DC R~    1585 Kip       NA               Shaw, Logan ~     38.9
##  9  4529 Bert~    5803 Bertina'~ NA               Eastland Gar~     38.9
## 10  4967 DC, ~    7086 Seveer    NA               Ivy City, Ar~     38.9
## # ... with 9,332 more rows, and 10 more variables: longitude <dbl>,
## #   room_type <chr>, price <dbl>, minimum_nights <dbl>,
## #   number_of_reviews <dbl>, last_review <chr>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   roomtype_id <dbl>

As we can see there is a lot of data to parse through in the loaded listings.csv file. Therefore, we want to clean and tidy the data and look at items we will find more useful for our tutorial. Tidying data is the process of structuring a dataset so that it is easier to change. When manipulating a tidy dataset, we are analyzing the values by performing various operations such as making new columns for mean and standardization! Another way of manipulating the data is by changing the individual values of certain columns to allow us to perform these operations. We select only the columns we are interested in and also drop any columns with n/a entries. We will name the resulting table ‘listings_data’. We also rename the columns for a cleaner format. The renamed columns are: “Host_ID”, “Neighborhood”, “Latitude”, “Longitude”, “Room_Type”, “Price”, “Minimum_Nights”, “Number_of_Reviews”, “Reviews_per_Month”, “Host_Listings_Count”, “Yearly_Availability”, “Room_Type_ID”.

listings %>% 
  drop_na()

## # A tibble: 0 x 17
## # ... with 17 variables: id <dbl>, name <chr>, host_id <dbl>, host_name <chr>,
## #   neighbourhood_group <lgl>, neighbourhood <chr>, latitude <dbl>,
## #   longitude <dbl>, room_type <chr>, price <dbl>, minimum_nights <dbl>,
## #   number_of_reviews <dbl>, last_review <chr>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   roomtype_id <dbl>

head(listings)

## # A tibble: 6 x 17
##      id name  host_id host_name neighbourhood_g~ neighbourhood latitude
##   <dbl> <chr>   <dbl> <chr>     <lgl>            <chr>            <dbl>
## 1  3362 Conv~    2798 Ayeh      NA               Shaw, Logan ~     38.9
## 2  3663 Clas~    4617 Shawn & ~ NA               Brightwood P~     39.0
## 3  3670 Beau~    4630 Sheila    NA               Howard Unive~     38.9
## 4  3686 Vita~    4645 Vita      NA               Historic Ana~     38.9
## 5  3771 Mt. ~    4795 Charlene  NA               Columbia Hei~     38.9
## 6  4002 2 Be~    5143 Anthony   NA               North Michig~     38.9
## # ... with 10 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <chr>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>, roomtype_id <dbl>

listings_data <- select(listings, host_id, neighbourhood, latitude, longitude,room_type, price, minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365, roomtype_id)

listings_data

## # A tibble: 9,342 x 12
##    host_id neighbourhood latitude longitude room_type price minimum_nights
##      <dbl> <chr>            <dbl>     <dbl> <chr>     <dbl>          <dbl>
##  1    2798 Shaw, Logan ~     38.9     -77.0 Entire h~   433              2
##  2    4617 Brightwood P~     39.0     -77.0 Entire h~   154              3
##  3    4630 Howard Unive~     38.9     -77.0 Private ~    75              2
##  4    4645 Historic Ana~     38.9     -77.0 Private ~    55              2
##  5    4795 Columbia Hei~     38.9     -77.0 Private ~    88              1
##  6    5143 North Michig~     38.9     -77.0 Entire h~   120              2
##  7    5061 Capitol Hill~     38.9     -77.0 Private ~    83              7
##  8    1585 Shaw, Logan ~     38.9     -77.0 Private ~   475              2
##  9    5803 Eastland Gar~     38.9     -76.9 Private ~    52             30
## 10    7086 Ivy City, Ar~     38.9     -77.0 Private ~    99              2
## # ... with 9,332 more rows, and 5 more variables: number_of_reviews <dbl>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>, roomtype_id <dbl>

head(listings_data)

## # A tibble: 6 x 12
##   host_id neighbourhood latitude longitude room_type price minimum_nights
##     <dbl> <chr>            <dbl>     <dbl> <chr>     <dbl>          <dbl>
## 1    2798 Shaw, Logan ~     38.9     -77.0 Entire h~   433              2
## 2    4617 Brightwood P~     39.0     -77.0 Entire h~   154              3
## 3    4630 Howard Unive~     38.9     -77.0 Private ~    75              2
## 4    4645 Historic Ana~     38.9     -77.0 Private ~    55              2
## 5    4795 Columbia Hei~     38.9     -77.0 Private ~    88              1
## 6    5143 North Michig~     38.9     -77.0 Entire h~   120              2
## # ... with 5 more variables: number_of_reviews <dbl>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   roomtype_id <dbl>

colnames(listings_data) = c("Host_ID",
    "Neighborhood",
    "Latitude",
    "Longitude",
    "Room_Type",
    "Price",
    "Minimum_Nights",
    "Number_of_Reviews",
    "Reviews_per_Month",
    "Host_Listings_Count",
    "Yearly_Availability",
    "Room_Type_ID")
  

head(listings_data)

## # A tibble: 6 x 12
##   Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##     <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
## 1    2798 Shaw, Logan~     38.9     -77.0 Entire h~   433              2
## 2    4617 Brightwood ~     39.0     -77.0 Entire h~   154              3
## 3    4630 Howard Univ~     38.9     -77.0 Private ~    75              2
## 4    4645 Historic An~     38.9     -77.0 Private ~    55              2
## 5    4795 Columbia He~     38.9     -77.0 Private ~    88              1
## 6    5143 North Michi~     38.9     -77.0 Entire h~   120              2
## # ... with 5 more variables: Number_of_Reviews <dbl>, Reviews_per_Month <dbl>,
## #   Host_Listings_Count <dbl>, Yearly_Availability <dbl>, Room_Type_ID <dbl>

Great, the data has been cleaned and organized! The key features of the table that we will be focusing on are the following properties for each listing:

Host ID - ID of the host of listing
Neighborhood - where the listing is located in the D.C. area
Latitude
Longitude
Room Type - the type of rooms are private rooms, shared rooms, hotel rooms, or entire homes/apartments.
Minimum Nights - minimum number of nights of stay for each user
Number of Reviews - number of reviews of listing
Reviews per Month - how many reviews the listing receives per month
Host Listings Count - how many listings the host has
Yearly Availability - how many days in a year for which the listing is available

Now, we have a tidy data table we will be working with for the tutorial.

3. Grouping and Plotting Data

With a cleaner formatted data, it will be easier to parse for information that we want to specifically analyze. Now, we will be able to group our data according to trends and also plot it so we can visualize patterns and perform analytics. First, let’s see how many records exist in listings_data.

nrow(listings_data)

## [1] 9342

There are a total of 9,342 entries in our dataset (number of listings in Washington D.C.). Next, we want to see how many distinct neighborhoods there are in the D.C. area.

count(distinct(listings_data, Neighborhood))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    39

Given that there are 39 different neighborhoods, let’s take a look at the distributions of the different listings there are per neighborhood in the D.C. area.

p <- ggplot(data = listings_data) + geom_histogram(aes(Neighborhood, fill = Neighborhood), stat = "count",alpha = 0.85, width = .5, position = position_dodge(width=10)) + 
  theme_minimal(base_size=8) + xlab("") + ylab("") + theme(legend.position="none") + 
  ggtitle("The Number of Listings in Each Area") + theme(axis.text.x = element_text(angle=20, hjust=1)) +
coord_flip()
p

As we can see, there is a huge variation in how many listings there are per neighborhood. It seems that within the Union Station neighborhood there are the most number of listings with about a 1000 listings. The second highest neighborhood is Columbia Heights and the third highest is Capitol Hill. Most of the listings are in the range of 125 listings up to about 500 listings per neighborhood. So for any user, there are plenty of options to choose from when traveling to Washington D.C.

Second, it might be helpful to see what the price ranges are for the Airbnb listings in the D.C. area. Understanding the price breakdown will help will budgeting for travelers and help hosts determine how to set the pricing of their own listings. We will demonstrate how to visualize the price distribution in the listings of the D.C. area using a cumulative distribution function.

p <- ggplot(data = listings_data, aes(Price)) + 
  stat_ecdf(geom = "step", color = '#fd5c63', lwd = 1.2) + 
  ylab("Proportion") + xlab("Price") + theme_minimal(base_size = 13) + xlim(0, 1000) +
  ggtitle("The Cumulative Distrubition of Listings Price") 
p

The cumulative distribution plot indicates that about 80% of the Airbnb properties in Washington D.C. are less than 250 dollars, and the median price of Airbnb properties are about 125 dollars a night.

Third, let’s observe the kind of rooms there are for the listings for each of the neighborhoods in D.C. There are four different types of rooms per listing in the D.C. area. There are entire homes/apartments, there are private rooms, there are shared rooms and hotel rooms. It would be helpful to observe the breakdown of how many different different neighborhoods have each of the different room types.

p <- ggplot(data = listings_data) + geom_histogram(aes(Neighborhood, fill = Room_Type), stat = "count",alpha = 0.85, position = 'fill') + 
  theme_minimal(base_size=8)+ xlab("") + ylab("") + 
  ggtitle("The Proportion of Room Type in Each Area") + coord_flip()
p

It seems that majority of the listings are entire hoom/apartments and private rooms. In the West End neighborhood there is the higheset number of entire home/apartment listings. In the Woodland/Fort Stattion neighborhood and Mayfair neighborhood it seems that there are the same number of private room listings. In the Woodland/Fort Station neighborhood specifically there is the highest number of shared room listings. There are not that many hotel room listings in the D.C. area. It seems that only the Capitol Hill, Dupont Circle, Georgetown and Near Southwest neighborhoods have hotel rooms as listings.

Let’s take a closer look at the breakdowns of the room types in the different neighborhoods. We want to see how many unique listings there are for each of the room types. We can display data for how many of the listings are private rooms, shared rooms, hotel rooms, and how many are entire homes/apartments. We can generate a table to show this data.

df <- listings_data %>%
  group_by(Room_Type) %>%
  mutate(Counts = n())

df

## # A tibble: 9,342 x 13
## # Groups:   Room_Type [4]
##    Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##      <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
##  1    2798 Shaw, Logan~     38.9     -77.0 Entire h~   433              2
##  2    4617 Brightwood ~     39.0     -77.0 Entire h~   154              3
##  3    4630 Howard Univ~     38.9     -77.0 Private ~    75              2
##  4    4645 Historic An~     38.9     -77.0 Private ~    55              2
##  5    4795 Columbia He~     38.9     -77.0 Private ~    88              1
##  6    5143 North Michi~     38.9     -77.0 Entire h~   120              2
##  7    5061 Capitol Hil~     38.9     -77.0 Private ~    83              7
##  8    1585 Shaw, Logan~     38.9     -77.0 Private ~   475              2
##  9    5803 Eastland Ga~     38.9     -76.9 Private ~    52             30
## 10    7086 Ivy City, A~     38.9     -77.0 Private ~    99              2
## # ... with 9,332 more rows, and 6 more variables: Number_of_Reviews <dbl>,
## #   Reviews_per_Month <dbl>, Host_Listings_Count <dbl>,
## #   Yearly_Availability <dbl>, Room_Type_ID <dbl>, Counts <int>

roomTypes <- df[!duplicated(df$Counts),]

roomTypes

## # A tibble: 4 x 13
## # Groups:   Room_Type [4]
##   Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##     <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
## 1  2.80e3 Shaw, Logan~     38.9     -77.0 Entire h~   433              2
## 2  4.63e3 Howard Univ~     38.9     -77.0 Private ~    75              2
## 3  3.21e4 Dupont Circ~     38.9     -77.0 Shared r~    90              2
## 4  1.38e7 Georgetown,~     38.9     -77.1 Hotel ro~   185              2
## # ... with 6 more variables: Number_of_Reviews <dbl>, Reviews_per_Month <dbl>,
## #   Host_Listings_Count <dbl>, Yearly_Availability <dbl>, Room_Type_ID <dbl>,
## #   Counts <int>

We can visualize the total number of the different room type listings in a bar graph. This will help us determine the variablity in the options for the travelers and for Airbnb analytics.

p <-ggplot(data=roomTypes, aes(x=Room_Type, y=Counts)) +
  geom_bar(stat="identity", color="#FF5A5F", fill="white")+
  geom_text(aes(label=Counts), vjust=-0.3, size=3.5) +
  theme_minimal() + ggtitle("Different Types of Rooms in Listings")

p

Based off of this graph, we can see that we were right. Entire home/apartment listings has the highest number with 6,553 listings and hotel rooms are the lowest number of listings with only 46 listings total. Regardless, travelers have plenty of options to choose from to live comfortably during their travel to D.C.

Fourth, let’s see how many of these listings are unique based on Host ID. This will allow us to see whether or not a single host has multiple listings.

count(distinct(listings_data, Host_ID))

## # A tibble: 1 x 1
##       n
##   <int>
## 1  5988

As we can see, there are 5,988 unique hosts. This means that there are hosts with multiple listings in Washington D.C. With this information, we can determine which host has the most listings in Washington D.C. by sorting the data according to Host_Listings_Count in descending order. We also make sure to remove duplicates so that we have distinct Host ID values. We narrow our dataframe down to the top 10 hosts.

top_hosts <- listings_data[order(-listings_data$Host_Listings_Count),]

top_hosts <- top_hosts[!duplicated(top_hosts$Host_Listings_Count),]

top_hosts <- top_hosts[1:10,]

top_hosts

## # A tibble: 10 x 12
##    Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##      <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
##  1  4.80e7 Dupont Circ~     38.9     -77.0 Entire h~   244             30
##  2  1.07e8 Howard Univ~     38.9     -77.0 Entire h~   175             30
##  3  1.60e7 Edgewood, B~     38.9     -77.0 Private ~    48             91
##  4  2.95e8 Capitol Hil~     38.9     -77.0 Private ~    49             30
##  5  4.66e7 Union Stati~     38.9     -77.0 Entire h~    99              3
##  6  3.99e7 Shaw, Logan~     38.9     -77.0 Entire h~   125              1
##  7  1.76e4 Spring Vall~     38.9     -77.1 Entire h~    99              3
##  8  8.01e6 Downtown, C~     38.9     -77.0 Entire h~   225              3
##  9  1.47e8 Capitol Hil~     38.9     -77.0 Hotel ro~   150              1
## 10  3.03e7 Downtown, C~     38.9     -77.0 Entire h~   219              2
## # ... with 5 more variables: Number_of_Reviews <dbl>, Reviews_per_Month <dbl>,
## #   Host_Listings_Count <dbl>, Yearly_Availability <dbl>, Room_Type_ID <dbl>

The top 10 hosts are spread across different neighboords, but it seems that the Downtown neighborhood two of the most active hosts with the most number of listings. Let’s take a look at how the number of listings per top ten hosts vary in a bar graph. This will help us understand how the hosts and the listings vary.

top_hosts$Host_ID <- factor(top_hosts$Host_ID) 


p <-ggplot(data=top_hosts, aes(x=Host_ID, y=Host_Listings_Count)) +
  geom_bar(stat="identity", color="white", fill="#FF5A5F")+
  geom_text(aes(label=Host_Listings_Count), vjust=-0.3, size=3.5) +
  theme_minimal() + ggtitle("Number of Listings of Top 10 Hosts")

p

Host 48005494 has the highest number of listings with 272 listings. Then, Host 107434423 has the next highest 165 highest. The lowest number of listings for the top 10 hosts with the most listings is Host 30283594 with only 38 listings.

Now that we have a deeper understanding of the Airbnb listings data in the Washington D.C. area, we can continue to explore further. Let’s see how we can visualize some of our understandings on a map.

4. Expressing Data on A Map

Next, we will be expressing the Airbnb listings in Washington D.C. We will express the top 30 listings with the most reviews on a map of Washington D.C, as well as their prices so we can see how they vary by location.

First, we will create the dataframe that will be used in our map.

top_reviews <- listings_data[order(-listings_data$Number_of_Reviews),]
top_reviews <- top_reviews[1:30,]

top_reviews

## # A tibble: 30 x 12
##    Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##      <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
##  1  2.25e6 Union Stati~     38.9     -77.0 Entire h~    89              2
##  2  1.40e6 Union Stati~     38.9     -77.0 Private ~    85              2
##  3  1.40e6 Union Stati~     38.9     -77.0 Entire h~    99              2
##  4  2.97e4 Shaw, Logan~     38.9     -77.0 Entire h~   129              2
##  5  2.05e7 Edgewood, B~     38.9     -77.0 Entire h~   115              1
##  6  5.66e4 Downtown, C~     38.9     -77.0 Private ~    85              3
##  7  9.82e7 Howard Univ~     38.9     -77.0 Entire h~    79              1
##  8  1.45e7 Friendship ~     39.0     -77.1 Entire h~    80              1
##  9  2.04e6 Capitol Hil~     38.9     -77.0 Entire h~   188              3
## 10  3.67e6 Southwest E~     38.9     -77.0 Private ~    60              1
## # ... with 20 more rows, and 5 more variables: Number_of_Reviews <dbl>,
## #   Reviews_per_Month <dbl>, Host_Listings_Count <dbl>,
## #   Yearly_Availability <dbl>, Room_Type_ID <dbl>

Now, we can express these listings on a map!

pal <- colorNumeric(
  palette = "viridis",
  domain = top_reviews$Price
)

dc_reviews<- leaflet(top_reviews) %>%
  addTiles() %>%
  addCircles(color=~pal(Price)) %>%
  addLegend("topright", pal=pal, values=~Price,title="Price", opacity = 2) %>%
  addControl("Prices for Top 30 Listings in D.C.", position = "bottomleft") %>%
  setView(lat=38.902, lng=-77.03, zoom=11.25)

## Assuming "Longitude" and "Latitude" are longitude and latitude, respectively

dc_reviews

Based on the map, we observe a trend that the further away we move from the center of Washington D.C., the prices tend to decrease for the listings.

5. Regression Analysis

Linear regression is used to predict the future data based on the patterns of the data we already have. We have the number of listings per neighborhood, we have observed the different room types for each of the listings, and we have also seen how prices are based on the location of D.C. One of our larger questions and concerns for this data is how much does the price vary for each of the neighborhoods in the D.C. area. Let’s take a look at how the price distribution works for the top 5 neighborhoods. In order to do this, we will create a dataframe of listings within the top 5 neighborhoods.

df <- listings_data %>%
  group_by(Neighborhood) %>%
  mutate(Counts = n())

df

## # A tibble: 9,342 x 13
## # Groups:   Neighborhood [39]
##    Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##      <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
##  1    2798 Shaw, Logan~     38.9     -77.0 Entire h~   433              2
##  2    4617 Brightwood ~     39.0     -77.0 Entire h~   154              3
##  3    4630 Howard Univ~     38.9     -77.0 Private ~    75              2
##  4    4645 Historic An~     38.9     -77.0 Private ~    55              2
##  5    4795 Columbia He~     38.9     -77.0 Private ~    88              1
##  6    5143 North Michi~     38.9     -77.0 Entire h~   120              2
##  7    5061 Capitol Hil~     38.9     -77.0 Private ~    83              7
##  8    1585 Shaw, Logan~     38.9     -77.0 Private ~   475              2
##  9    5803 Eastland Ga~     38.9     -76.9 Private ~    52             30
## 10    7086 Ivy City, A~     38.9     -77.0 Private ~    99              2
## # ... with 9,332 more rows, and 6 more variables: Number_of_Reviews <dbl>,
## #   Reviews_per_Month <dbl>, Host_Listings_Count <dbl>,
## #   Yearly_Availability <dbl>, Room_Type_ID <dbl>, Counts <int>

new_data <- df[!duplicated(df$Counts),]

top_neighborhoods <- new_data[order(-new_data$Counts),]


top_neighborhoods <- top_neighborhoods[1:5,]

top_neighborhoods

## # A tibble: 5 x 13
## # Groups:   Neighborhood [5]
##   Host_ID Neighborhood Latitude Longitude Room_Type Price Minimum_Nights
##     <dbl> <chr>           <dbl>     <dbl> <chr>     <dbl>          <dbl>
## 1   64814 Union Stati~     38.9     -77.0 Entire h~    87             30
## 2    4795 Columbia He~     38.9     -77.0 Private ~    88              1
## 3    5061 Capitol Hil~     38.9     -77.0 Private ~    83              7
## 4   32067 Dupont Circ~     38.9     -77.0 Entire h~   195              2
## 5  315148 Edgewood, B~     38.9     -77.0 Private ~    35             28
## # ... with 6 more variables: Number_of_Reviews <dbl>, Reviews_per_Month <dbl>,
## #   Host_Listings_Count <dbl>, Yearly_Availability <dbl>, Room_Type_ID <dbl>,
## #   Counts <int>

Then, we will filter the data we have for those top 5 neighborhoods that we have found.

new_data <- listings_data[listings_data$Neighborhood %in% c("Union Station, Stanton Park, Kingman Park", "Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View", "Capitol Hill, Lincoln Park", "Dupont Circle, Connecticut Avenue/K Street", "Edgewood, Bloomingdale, Truxton Circle, Eckington"), ]

new_data <- select(new_data, Neighborhood, Price)

new_data

## # A tibble: 4,145 x 2
##    Neighborhood                                               Price
##    <chr>                                                      <dbl>
##  1 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View    88
##  2 Capitol Hill, Lincoln Park                                    83
##  3 Dupont Circle, Connecticut Avenue/K Street                   195
##  4 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View   125
##  5 Union Station, Stanton Park, Kingman Park                     87
##  6 Dupont Circle, Connecticut Avenue/K Street                    90
##  7 Union Station, Stanton Park, Kingman Park                    645
##  8 Dupont Circle, Connecticut Avenue/K Street                   195
##  9 Edgewood, Bloomingdale, Truxton Circle, Eckington             35
## 10 Columbia Heights, Mt. Pleasant, Pleasant Plains, Park View    99
## # ... with 4,135 more rows

At this point, we have all of the data of the top 5 neighborhoods and their respective prices. So next, we need to find the distribution of prices across those neighborhoods. We can now create a violin plot to express the price distribution for these 5 neighborhoods. These linear regression models depict how price is correlated with the top 5 neighborhoods.

library(ggplot2)
# Basic violin plot
p <- ggplot(new_data, aes(x=Neighborhood, y=Price)) + 
  geom_violin(trim=FALSE, fill = "#FF5A5F") + ylim(0, 850) + ggtitle("Distribution of Top 5 Neighborhoods") +
        theme(axis.text.x = element_text(angle=15, hjust=1)) + geom_boxplot(width=0.1)

# Rotate the violin plot

p

Now, we have graphed the distribution of prices across the top 5 neighborhoods. We can observe some trends about the distribution of prices for Airbnb listings in neighborhoods in Washington D.C. Columbia Heights and Dupont Circle have very similar distributions. All 5 neighborhoods mostly have listings under 200 dollars. Capitol Hill and Union Station have the largest standard deviations, meaning that they have the greatest range of prices. Listings within Columbia Heights and Edgewood are mostly within the 100 dollar range.

6. Hypothesis Testing: Logisic Regression and Predictive Modeling

From our data, we are able to determine which factors have the greatest effect on the price of a listing. We will now use regression analysis to check if the room types and the number of reviews affect the price and to what extent. Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.

summary(listings_data)

##     Host_ID          Neighborhood          Latitude       Longitude     
##  Min.   :     1585   Length:9342        Min.   :38.82   Min.   :-77.11  
##  1st Qu.: 11225625   Class :character   1st Qu.:38.90   1st Qu.:-77.04  
##  Median : 37338913   Mode  :character   Median :38.91   Median :-77.02  
##  Mean   : 67775752                      Mean   :38.91   Mean   :-77.02  
##  3rd Qu.:102213703                      3rd Qu.:38.92   3rd Qu.:-77.00  
##  Max.   :344075817                      Max.   :38.99   Max.   :-76.91  
##                                                                         
##   Room_Type             Price       Minimum_Nights    Number_of_Reviews
##  Length:9342        Min.   :    0   Min.   :  1.000   Min.   :  0.00   
##  Class :character   1st Qu.:   75   1st Qu.:  1.000   1st Qu.:  1.00   
##  Mode  :character   Median :  110   Median :  2.000   Median :  9.00   
##                     Mean   :  204   Mean   :  7.912   Mean   : 39.71   
##                     3rd Qu.:  175   3rd Qu.:  3.000   3rd Qu.: 49.00   
##                     Max.   :10000   Max.   :600.000   Max.   :830.00   
##                                                                        
##  Reviews_per_Month Host_Listings_Count Yearly_Availability  Room_Type_ID  
##  Min.   : 0.010    Min.   :  1.00      Min.   :  0.0       Min.   :1.000  
##  1st Qu.: 0.280    1st Qu.:  1.00      1st Qu.:  0.0       1st Qu.:1.000  
##  Median : 1.020    Median :  1.00      Median : 89.0       Median :1.000  
##  Mean   : 1.748    Mean   : 18.18      Mean   :132.6       Mean   :1.628  
##  3rd Qu.: 2.710    3rd Qu.:  4.00      3rd Qu.:257.0       3rd Qu.:3.000  
##  Max.   :12.140    Max.   :272.00      Max.   :365.0       Max.   :4.000  
##  NA's   :2175

#We check if the price data is above or below 100, if it is below it is set as 0, else as 1. This is done so that we can use the price as one of the factors in our regression analysis
price_data <- ifelse(listings_data$Price < 100 , 0, 1)

listings_data$Room_Type_ID <- factor(listings_data$Room_Type_ID)

#We use glm to do the regression analysis 
mylogit <- glm(price_data ~ Room_Type_ID, data = listings_data, family = "binomial")
broom::tidy(mylogit)%>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.9855905	0.0277676	35.494326	0.0000000
Room_Type_ID2	-0.8985791	0.2964663	-3.030966	0.0024377
Room_Type_ID3	-2.1726959	0.0556173	-39.065129	0.0000000
Room_Type_ID4	-3.0551146	0.1743784	-17.520033	0.0000000

summary(mylogit)

## 
## Call:
## glm(formula = price_data ~ Room_Type_ID, family = "binomial", 
##     data = listings_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6141  -0.7298   0.7964   0.7964   2.0921  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.98559    0.02777  35.494  < 2e-16 ***
## Room_Type_ID2 -0.89858    0.29647  -3.031  0.00244 ** 
## Room_Type_ID3 -2.17270    0.05562 -39.065  < 2e-16 ***
## Room_Type_ID4 -3.05511    0.17438 -17.520  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12725  on 9341  degrees of freedom
## Residual deviance: 10584  on 9338  degrees of freedom
## AIC: 10592
## 
## Number of Fisher Scoring iterations: 4

We can use the confint function to obtain confidence intervals for the coefficient estimates.

confint(mylogit)

## Waiting for profiling to be done...

##                    2.5 %     97.5 %
## (Intercept)    0.9313892  1.0402432
## Room_Type_ID2 -1.4812271 -0.3110788
## Room_Type_ID3 -2.2823696 -2.0643255
## Room_Type_ID4 -3.4122515 -2.7267776

confint.default(mylogit)

##                    2.5 %     97.5 %
## (Intercept)    0.9311671  1.0400139
## Room_Type_ID2 -1.4796423 -0.3175159
## Room_Type_ID3 -2.2817037 -2.0636880
## Room_Type_ID4 -3.3968899 -2.7133393

We now run the regression model on price data, room type and number of reviews.

mylogit2 <- lm(price_data ~ Room_Type_ID + Number_of_Reviews , data=listings_data)
broom::tidy(mylogit2) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.7524913	0.0060324	124.741728	0.0000000
Room_Type_ID2	-0.1936081	0.0642236	-3.014594	0.0025801
Room_Type_ID3	-0.4977991	0.0103546	-48.075266	0.0000000
Room_Type_ID4	-0.6317646	0.0242366	-26.066536	0.0000000
Number_of_Reviews	-0.0005768	0.0000657	-8.775158	0.0000000

summary(mylogit2)

## 
## Call:
## lm(formula = price_data ~ Room_Type_ID + Number_of_Reviews, data = listings_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7525 -0.2535  0.2475  0.2619  0.9933 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.525e-01  6.032e-03 124.742  < 2e-16 ***
## Room_Type_ID2     -1.936e-01  6.422e-02  -3.015  0.00258 ** 
## Room_Type_ID3     -4.978e-01  1.035e-02 -48.075  < 2e-16 ***
## Room_Type_ID4     -6.318e-01  2.424e-02 -26.067  < 2e-16 ***
## Number_of_Reviews -5.768e-04  6.574e-05  -8.775  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.434 on 9337 degrees of freedom
## Multiple R-squared:  0.2286, Adjusted R-squared:  0.2282 
## F-statistic: 691.6 on 4 and 9337 DF,  p-value: < 2.2e-16

In order to reject the null hypothesis, the p-value for the individual categories must be below our alpha value of 0.05. When we reject the null hypothesis, that means that specific factors do not have an effect on the price of the rentable space. As you can see from the charts, the p-value is consistently below 0.05, in both cases with and without the number of reviews therefore we can reject the null hypothesis for those factors. Our hypothesis is that choosing an Airbnb based on reviews is more useful than just checking its price and room type. Now we will compare graphs of fitted vs residuals to see how accurate our assumptions were from our hypothesis testing above. Through this process, we will figure out how far off our model is from the actual data point.

#prices and room type - residuals above and farther from 0 
augmented_data1 <- mylogit %>%
  augment() %>%
  select(-.se.fit, -.hat, -.sigma, -.cooksd, -.std.resid)
augmented_data1 %>%
  ggplot(aes(x=.fitted, y=.resid)) +
    geom_violin(fill = "#FF5A5F")+
  ggtitle("Analysis of price and room type")

#number of reviews are effected based on price and room type- residuals closer to 0 
augmented_data2 <- mylogit2 %>%
  augment() %>%
  select(-.se.fit, -.hat, -.sigma, -.cooksd, -.std.resid)
augmented_data2 %>%
  ggplot(aes(x=.fitted, y=.resid)) +
  geom_violin(fill = "#00A699")+
  ggtitle("Analysis of price, room type and number of reviews")

  augmented_data1

## # A tibble: 9,342 x 4
##    price_data Room_Type_ID .fitted .resid
##         <dbl> <fct>          <dbl>  <dbl>
##  1          1 1              0.986  0.796
##  2          1 1              0.986  0.796
##  3          0 3             -1.19  -0.730
##  4          0 3             -1.19  -0.730
##  5          0 3             -1.19  -0.730
##  6          1 1              0.986  0.796
##  7          0 3             -1.19  -0.730
##  8          1 3             -1.19   1.70 
##  9          0 3             -1.19  -0.730
## 10          0 3             -1.19  -0.730
## # ... with 9,332 more rows

augmented_data2

## # A tibble: 9,342 x 5
##    price_data Room_Type_ID Number_of_Reviews .fitted .resid
##         <dbl> <fct>                    <dbl>   <dbl>  <dbl>
##  1          1 1                          178   0.650  0.350
##  2          1 1                           41   0.729  0.271
##  3          0 3                           79   0.209 -0.209
##  4          0 3                           72   0.213 -0.213
##  5          0 3                            1   0.254 -0.254
##  6          1 1                          149   0.667  0.333
##  7          0 3                           45   0.229 -0.229
##  8          1 3                          120   0.185  0.815
##  9          0 3                          102   0.196 -0.196
## 10          0 3                           31   0.237 -0.237
## # ... with 9,332 more rows

Above, you can see that for price and room type(mylogit), the residuals are above and farther from 0, and for price, room type and number of reviews (mylogit2), the residuals are closer to 0. With the high frequencies around zero, we believe that our hypothesis is correct. Even though room type has an effect on the price, the number of reviews and price are correlated. Additionally the axis on mylogit2 violin plot is finer and focused on values between 0 and 1.

7. Summary and References

Yay! You have completed your tutorial on Airbnb’s in Washington D.C. and the factors that affect their prices. Now when you visit D.C., you will know the important factors like number of reviews, room types and neighborhoods to gauge before picking your AirBnb. In this tutorial we we answered the questions like “Where are the best neighborhoods to be in while in D.C.?”, “How does the price change based on room type and reviews?”. You can now ask your own questions about data and find answers to them. It is important to begin your journey by asking motivating questions and questioning the accurancy of the dataset used to answer those questions.

Some of the motivating questions we had in this tutorial were:

Where are the top 5 hosts in D.C.? We chose to run our tutorial on host IDs instead of names to respect the privacy of the hosts. You can choose what values you would like to add or omit accordingly.
How many listings do the top 5 hosts in D.C. have?
What are the different room types based on the number of listings from the top 5 hosts?
What are the prices of the top 30 listings in D.C. and where are they located? We used the map to display these locations effectively.
Based on the prices and listings, what are the best neighborhoods to live in?

We then did a regression analysis on the price, room type and the effects of the number of reviews on those factors. We hope you enjoyed our tutorial as much as we enjoyed creating it. We encourage you to continue exploring the World of Data Science and have the power to make informed decisions based on data.

For further information on data science and machine learning, you can visit:

https://mlr.mlr-org.com/ for your machine learning needs
http://101.datascience.community/ for short posts to advance your learning a little every day
https://www.oreilly.com/data/newsletter.html for data science and business focused information
http://www.datascienceassn.org/ To find your own data science community and even possibly get a certification
https://nips.cc/ To meet minds like yours and share your notes on information processing in different areas

Good Luck and Stay Safe!

Final Project: Analysis of Airbnb’s Washington D.C. Data

Aysha Qazi, Ummey Hossain, Isha Angadi

5/13/2020