Reputation: 417
I created a data frame:
df <- mydata %>%
mutate(length.class=cut(mydata$count,breaks = c(1, 10, 100, 1000, 10000),include.lowest=TRUE)) %>%
group_by(length.class) %>%
summarise(count = n())
This results df$length.class
to have values like "(100,1e+03]" while I prefer to have "(100,1000]". However, changing it manually doesn't work:
df$length.class[df$length.class == "(100,1e+03]"] <- "(100,1000]"
Warnmeldung:
In `[<-.factor`(`*tmp*`, df$length.class == "(100,1e+03]", value = c(1L, :
invalid factor level, NA generated
Why is changing the string not possible and what tries R to tell me with this message?
Bonus question: how can I get the original value back or address the changed row (4)? After executing the change command there's now a "NA" instead of "(100,1e+03]".
Upvotes: 0
Views: 63
Reputation: 36076
The dig.lab
argument in cut
should take care of this.
From the documentation:
integer which is used when labels are not given. It determines the number of digits used in formatting the break numbers.
In your case, you want to show 5 digits so your code would be
mydata %>%
mutate(length.class = cut(count, breaks = c(1, 10, 100, 1000, 10000),
include.lowest = TRUE, dig.lab = 5))
The levels of the resulting factor look like:
[1] "[1,10]" "(10,100]" "(100,1000]" "(1000,10000]"
Upvotes: 4
Reputation: 4907
The warning (below) tells you all you need to know.
Warnmeldung:
In `[<-.factor`(`*tmp*`, df$length.class == "(100,1e+03]", value = c(1L, :
invalid factor level, NA generated
df$length.class
is a factor
, whose values are stored as distinct levels
. You're trying to replace the values in the incorrect manner. The appropriate way is to change the levels
, not the displayed values. Factors are stored as integers, with a map between the integers in memory and the levels
used for display.
The first option is compact, though it's admittedly not very readable. Or you could always use two lines of code
levels(df$length.class)[ which(levels(df$length.class) == "(100,1e+03]") ] <- "(100,1000]"
lvl_idx <- which(levels(df$length.class) == "(100,1e+03]")
levels(df$length.class)[lvl_idx] <- "(100,1000]"
Upvotes: 1
Reputation: 160407
library(dplyr)
brks <- 10^(0:4)
# ensure one in each bin
mydata <- data.frame(count = brks[-1] - diff(brks)/2)
# create labels to be used in `cut`
lbls <- mapply(paste0, "(", head(brks, n = -1), ",", brks[-1], "]")
# fix the first, it's open on the left
lbls[1] <- paste0("[", brks[1], ",", brks[2], "]")
df <- mydata %>%
mutate(length.class = cut(count, breaks = brks, labels = lbls,
include.lowest = TRUE)) %>%
group_by(length.class) %>% summarise(count = n())
df
# # A tibble: 4 x 2
# length.class count
# <fctr> <int>
# 1 [1,10] 1
# 2 (10,100] 1
# 3 (100,1000] 1
# 4 (1000,10000] 1
If you don't want to redo the calculation, you can simply do:
labels(df$length.count) <- lbls
(assuming you defined lbls
correctly per the number of levels/bins).
Some notes about the code:
mydata$
within mutate
: your code references the value of mydata$count
outside of the pipe, which can be different from the current value of the count
column of the data.frame in the pipe; it isn't here, but it easily can be, especially with preceding mutate
or group_by
verbs.Upvotes: 0