Reputation: 308
When we group a data.table
by using breaks cut using ordered_result = TRUE
does not list the cut breaks labels in increasing order (rather it seems to be in the order in which the breaks labels are found in the data.table
, which is the same behaviour as with ordered_result = FALSE
. Why does data.table
not care about ordered factors ?
> aaa <- c(1,2,3,4,5,2,3,4,5,6,7)
> aaa <- rev(aaa)
> d <- data.table(x = 1:length(aaa), val = aaa)
> # The following statement will not order the group by result using the ordered labels in increasing fashion.
> d[, sum(x), by = cut(aaa, 3, ordered_result = TRUE)]
cut V1
1: (5,7.01] 3
2: (3,5] 22
3: (0.994,3] 41
> # Infact, the behavior is same as with ordered_result = FALSE
> d[, sum(x), by = cut(aaa, 3, ordered_result = FALSE)]
cut V1
1: (5,7.01] 3
2: (3,5] 22
3: (0.994,3] 41
Upvotes: 0
Views: 241
Reputation: 7941
The difference between that ordering factors makes is largely limited to how the factors are treated in statistical models (it's alluded to in ?factor
but there's not a lot of detail).
The data.table extraction does not guarantee being sorted according to its by
argument (whether or not it is an ordered factor). To achieve that, use the keyby
argument:
d[, sum(x), keyby = cut(aaa, 3)]
# cut V1
#1: (0.994,3] 41
#2: (3,5] 22
#3: (5,7.01] 3
In your example, the factor ordering works correctly, in that the cut
column remains an ordered factor, compare the following:
str(d[, sum(x), by = cut(aaa, 3, ordered_result = TRUE)])
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ cut: Ord.factor w/ 3 levels "(0.994,3]"<"(3,5]"<..: 3 2 1
# $ V1 : int 3 22 41
# - attr(*, ".internal.selfref")=<externalptr>
str(d[, sum(x), by = cut(aaa, 3, ordered_result = FALSE)])
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ cut: Factor w/ 3 levels "(0.994,3]","(3,5]",..: 3 2 1
# $ V1 : int 3 22 41
# - attr(*, ".internal.selfref")=<externalptr>
Note the change in the class of cut
from Ord.factor
to Factor
.
Upvotes: 3