Reputation: 189
I have a data set of real estate data. I'm trying to create a new column of days on market groups (labeled DOM_Groups) and group them into 15-day intervals (i.e. 0-14, 15-29, etc.). Then I'm trying to summarize()
these groupings by the count of observations and the average sale price for each 15-day group.
I'm using the cut()
function attempting to break my DOM_Groups into these 15-day intervals. In the base spreadsheet that I imported, the column containing the days on market has a unique observation in each cell, and the data in that column are numeric whole numbers...no decimals, no negative numbers.
When I run the following code, the tibble output is not grouping correctly, and it is including a negative number with a decimal, which does not exist in my data set. I'm not sure what to do to correct this.
gibbsMkt %>%
mutate(DOM_Groups = cut(DOM, breaks = 15, dig.lab = 2)) %>%
filter(Status == "SOLD") %>%
group_by(DOM_Groups) %>%
summarize(numDOM = n(),
avgSP = mean(`Sold Price`, na.rm = TRUE))
The tibble output I get is this:
DOM_Groups numDOM avgSP
<fct> <int> <dbl>
1 (-0.23,16] 74 561675.
2 (16,31] 18 632241.
3 (31,47] 11 561727.
4 (47,63] 8 545862.
5 (63,78] 7 729286.
6 (78,94] 6 624167.
7 (1.4e+02,1.6e+02] 2 541000
8 (1.6e+02,1.7e+02] 1 535395
Also, for rows 7 & 8 in the tibble, the largest number is 164, so I also don't understand why these rows are being converted to scientific notation.
When I use an Excel pivot table, I get the output that I want to reproduce in R, which is depicted below:
How can I reproduce this in R with the correct code?
Upvotes: 0
Views: 1151
Reputation: 2262
Here's a simple solution with my santoku
package:
library(santoku)
gibbsMkt %>%
mutate(DOM_Groups = chop_width(DOM, 15, labels = lbl_dash("-")))
# then proceed as before
You can use the start
argument to chop_width
if you want to start the intervals at a particular number.
Upvotes: 1
Reputation: 20811
cut(x, breaks = 15)
means x
will be cut into 15 intervals--it cannot guess that you want 15-unit intervals starting with 0 and ending with 150. This is in the docs for ?cut
:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You will need to define your own start and end to each interval such as:
seq(0, max(x), 15)
# [1] 0 15 30 45 60 75 90 105 120 135 150
cut(x, seq(0, max(x), 15))
However, if you set it up correctly, you can define your intervals and make labels at the same time.
set.seed(1)
x <- floor(runif(500, 0, 164))
from <- seq(0, max(x), 15)
to <- from + 15 - 1
labs <- sprintf('%s-%s', from, to)
# [1] "0-14" "15-29" "30-44" "45-59" "60-74" "75-89" "90-104" "105-119" "120-134" "135-149" "150-164"
data.frame(table(cut(x, c(from, Inf), right = FALSE)), labels = labs)
# Var1 Freq labels
# 1 [0,15) 35 0-14
# 2 [15,30) 57 15-29
# 3 [30,45) 45 30-44
# 4 [45,60) 44 45-59
# 5 [60,75) 57 60-74
# 6 [75,90) 55 75-89
# 7 [90,105) 33 90-104
# 8 [105,120) 47 105-119
# 9 [120,135) 40 120-134
# 10 [135,150) 39 135-149
# 11 [150,Inf) 48 150-164
DOM_Groups <- cut(x, c(from, Inf), labs, right = FALSE)
data.frame(table(DOM_Groups))
# DOM_Groups Freq
# 1 0-14 35
# 2 15-29 57
# 3 30-44 45
# 4 45-59 44
# 5 60-74 57
# 6 75-89 55
# 7 90-104 33
# 8 105-119 47
# 9 120-134 40
# 10 135-149 39
# 11 150-164 48
Your other question of "why am I getting negative numbers," as I mentioned this does not mean that you have negatives in your data--these are just labels generated by using breaks = 15
with your data.
These are the relevant lines in cut.default
if (length(breaks) == 1L) {
if (is.na(breaks) || breaks < 2L)
stop("invalid number of intervals")
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] +
dx/1000)
}
Using the x
from before and breaks = 15
, you can see how negatives are introduced:
breaks <- 15
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0) {
dx <- if (rx[1L] != 0)
abs(rx[1L])
else 1
breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
} else {
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
}
breaks
# [1] -0.16300 10.86667 21.73333 32.60000 43.46667 54.33333 65.20000 76.06667 86.93333 97.80000 108.66667 119.53333 130.40000
# [14] 141.26667 152.13333 163.16300
levels(cut(x, breaks = 15))
# [1] "(-0.163,10.9]" "(10.9,21.7]" "(21.7,32.6]" "(32.6,43.5]" "(43.5,54.3]" "(54.3,65.2]" "(65.2,76.1]" "(76.1,86.9]"
# [9] "(86.9,97.8]" "(97.8,109]" "(109,120]" "(120,130]" "(130,141]" "(141,152]" "(152,163]"
Upvotes: 3