haggis
haggis

Reputation: 417

Changing value in data frame not possible

I created a data frame:

df <- mydata %>%
  mutate(length.class=cut(mydata$count,breaks = c(1, 10, 100, 1000, 10000),include.lowest=TRUE)) %>%
  group_by(length.class) %>%
  summarise(count = n())

This results df$length.class to have values like "(100,1e+03]" while I prefer to have "(100,1000]". However, changing it manually doesn't work:

df$length.class[df$length.class == "(100,1e+03]"] <- "(100,1000]"

Warnmeldung:
In `[<-.factor`(`*tmp*`, df$length.class == "(100,1e+03]", value = c(1L,  :
  invalid factor level, NA generated

Why is changing the string not possible and what tries R to tell me with this message?

Bonus question: how can I get the original value back or address the changed row (4)? After executing the change command there's now a "NA" instead of "(100,1e+03]".

Upvotes: 0

Views: 63

Answers (3)

aosmith
aosmith

Reputation: 36076

The dig.lab argument in cut should take care of this.

From the documentation:

integer which is used when labels are not given. It determines the number of digits used in formatting the break numbers.

In your case, you want to show 5 digits so your code would be

mydata %>%
    mutate(length.class = cut(count, breaks = c(1, 10, 100, 1000, 10000), 
                            include.lowest = TRUE, dig.lab = 5))

The levels of the resulting factor look like:

[1] "[1,10]"       "(10,100]"     "(100,1000]"   "(1000,10000]"

Upvotes: 4

alexwhitworth
alexwhitworth

Reputation: 4907

The warning (below) tells you all you need to know.

Warnmeldung:
In `[<-.factor`(`*tmp*`, df$length.class == "(100,1e+03]", value = c(1L,  :
  invalid factor level, NA generated

df$length.class is a factor, whose values are stored as distinct levels. You're trying to replace the values in the incorrect manner. The appropriate way is to change the levels, not the displayed values. Factors are stored as integers, with a map between the integers in memory and the levels used for display.

The first option is compact, though it's admittedly not very readable. Or you could always use two lines of code

levels(df$length.class)[ which(levels(df$length.class) == "(100,1e+03]") ] <- "(100,1000]"

lvl_idx <- which(levels(df$length.class) == "(100,1e+03]") 
levels(df$length.class)[lvl_idx] <- "(100,1000]"

Upvotes: 1

r2evans
r2evans

Reputation: 160407

library(dplyr)
brks <- 10^(0:4)
# ensure one in each bin
mydata <- data.frame(count = brks[-1] - diff(brks)/2)

# create labels to be used in `cut`
lbls <- mapply(paste0, "(", head(brks, n = -1), ",", brks[-1], "]")
# fix the first, it's open on the left
lbls[1] <- paste0("[", brks[1], ",", brks[2], "]")

df <- mydata %>%
  mutate(length.class = cut(count, breaks = brks, labels = lbls,
         include.lowest = TRUE)) %>%
  group_by(length.class) %>% summarise(count = n())
df
# # A tibble: 4 x 2
#   length.class count
#         <fctr> <int>
# 1       [1,10]     1
# 2     (10,100]     1
# 3   (100,1000]     1
# 4 (1000,10000]     1

If you don't want to redo the calculation, you can simply do:

labels(df$length.count) <- lbls

(assuming you defined lbls correctly per the number of levels/bins).

Some notes about the code:

  • you don't need mydata$ within mutate: your code references the value of mydata$count outside of the pipe, which can be different from the current value of the count column of the data.frame in the pipe; it isn't here, but it easily can be, especially with preceding mutate or group_by verbs.
  • minor, but many consider the use of dots in variable names to be more than just a style issue: because of the way R does class inheritance, it can cause some unnecessary lookups (though I think that's more with dotted function names than variables, it's also about consistency in a naming convention).

Upvotes: 0

Related Questions