ztl
ztl

Reputation: 2592

Wrong behaviour of labels from `cut` using `dplyr::rowwise`?

In R and using dplyr, I need to cut values in one one column using non-constant (not unique) breaks, these being defined, for each row, by values in other columns of a same data frame. Hence I use rowwise. While the breaks seem to be functionally working (i.e. updated for each line), the labels do not appear to be consistent.

For example:

library(dplyr)
set.seed(10)
myDF = data.frame(a=runif(5, min=0.3, max=0.7), 
                  bmin = rep(0, 5), 
                  bmid = c(0.5, 0.3, 0.6, 0.7, 0.4),
                  bmax = rep(1, 5))

myDF %>% rowwise() %>% mutate(grp1 = cut(a, breaks=c(bmin, bmid, bmax)),
                              grp2 = cut(a, breaks=c(bmin, bmid, bmax), 
                                         labels=c(paste(bmin, bmid, sep='-'),
                                                  paste(bmid, bmax, sep='-'))),
                              grp3 = cut(a, breaks=c(bmin, bmid, bmax), 
                                         labels=c(1, 2)))

#           a  bmin  bmid  bmax    grp1   grp2   grp3
#       (dbl) (dbl) (dbl) (dbl)  (fctr) (fctr) (fctr)
# 1 0.3901746     0   0.5     1 (0,0.5]  0-0.5      1
# 2 0.4098122     0   0.3     1 (0.5,1]  0.5-1      2
# 3 0.4089220     0   0.6     1 (0,0.5]  0-0.5      1
# 4 0.5463317     0   0.7     1 (0,0.5]  0-0.5      1
# 5 0.4718686     0   0.4     1 (0.5,1]  0.5-1      2

In this example, one can see e.g. on line 2 that the cut is functionnally correct (i.e., value 0.3 was properly used as cut point bmid instead of 0.5 from the first line), but the resulting label is wrong (i.e. (0.5, 1] does actually not contain the value 0.4098122 and 0.5 was indeed not the cut point).

grp2 is an attempt to manually set the labels, failing too, meaning that a manual breaks-independent solution as in grp3 appears to be the only way forward...

In short, rowwise does not seem to apply to the labels, but well to the cut points...

Am I missing anything, or is this a wrong behaviour? How can I label my intervals on a rowwise basis?

Upvotes: 3

Views: 330

Answers (1)

MrFlick
MrFlick

Reputation: 206167

The problem is that you are trying to build a factor column where each row has different levels/labels. This is not really possible with factors. The mutate seems to be trying to harmonize all the factor labels for you which is producing this odd effect. It's not unique to cut() see also

data.frame(z=c("a","b","c")) %>% rowwise() %>% mutate(g=factor(z))
#        z      g
#   (fctr) (fctr)
# 1      a      a
# 2      b      a
# 3      c      a

One workaround would be to return character values rather than factor values.

myDF %>% rowwise() %>% mutate(grp1 = as.character(cut(a, breaks=c(bmin, bmid, bmax))))

Upvotes: 6

Related Questions