Rob Hanssen
Rob Hanssen

Reputation: 193

R: Home sales in the last year before each sale

As a follow-up question to a previous one in the same project:

I found that real estate is often measured in inventory time, which is defined as (number of active listings) / (number of homes sale per month, as average over the last 12 months). The best way I could find to count the number of homes sold in the last 12 months before each home sale is through a for-loop.

homesales$yearlysales = 0

for (i in 1:nrow(homesales))
{
        sdt = as.Date(homesales$saledate[i])
        x <- homesales %>% filter( sdt - saledate >= 0 & sdt - saledate < 365) %>% summarise(count=n())
        homesales$yearlysales[i] =x$count[1]
}

homesales$inventorytime = homesales$inventory / homesales$yearlysales * 12
homesales$inventorytime[is.na(homesales$saledate)] = NA
homesales$inventorytime[homesales$yearlysales==0] = NA

Obviously (?), the R language has some prejudice against using a for-loop for doing this type of selections. Is there a better way?

Appendix 1. data table structure

address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...

Appendix 2. The output I'm looking for is something like this.

Upvotes: 0

Views: 36

Answers (1)

Michael Dewar
Michael Dewar

Reputation: 3318

The following gives you the number of active listings on any given day:

library(tidyverse)
library(lubridate)

tmp <- tempfile()
download.file("https://raw.githubusercontent.com/robhanssen/glenlake-homesales/master/homesalesdata-source.csv", tmp)

data <- read_csv(tmp) %>% 
  select(ends_with("date")) %>% 
  mutate(across(everything(), mdy)) %>% 
  pivot_longer(cols = everything(), names_to = "activity", values_to ="date", names_pattern = "(.*)date")

active <- data %>% 
  mutate(active = if_else(activity == "listing", 1, -1)) %>% 
  arrange(date) %>% 
  mutate(active = cumsum(active)) %>% 
  group_by(date) %>% 
  filter(row_number() == n()) %>% 
  select(-activity)

tibble(date = seq(min(data$date, na.rm = TRUE), max(data$date, na.rm = TRUE), by = "days")) %>% 
  left_join(active) %>% 
  fill(active)

Basically, we pivot longer and split each row of data into two rows indicating distinct activities: adding a listing or removing a listing. Then the cumulative sum of this gives you the number of active listings.

Note, this assumes that you are not missing any data. Depending on the specification from which the csv was made, you could be missing activity at the start or end. But this is a warning about the csv itself.

Active listings is a fact about an instant in time. Sales is a fact about a time period. You probably want to aggregate sales by month, and then use the number of active listings from the last day of the month, or perhaps the average number of listings over that month.

Upvotes: 0

Related Questions