Reputation: 646
I have the following problem when doing a cut on a data.table. I don't know why it doesn't do the filter by "prod" correctly. If I run the cut on the outside, as seen below, it does the cut correctly but not inside the data table. Do you know why and how I can fix it? Thank you
library(data.table)
db<-data.frame(count=c(331948, 334999, 321000, 305000, 324100, 310000, 305000, 325000, 305000, 329999, 315000,531948, 534999, 521000, 505000, 524100, 510000, 505000, 525000, 505000, 529999, 515000), prod=c("a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b"))
head(db)
count prod
1 331948 a
2 334999 a
3 321000 a
4 305000 a
5 324100 a
6 310000 a
setDT(db)[ , id := cut(count,8,digits=1,dig.lab = 7), by = prod]
count prod id
1: 331948 a (531249.1,535029]
2: 334999 a (531249.1,535029]
3: 321000 a (519999.5,523749.4]
4: 305000 a (504970,508749.9]
5: 324100 a (523749.4,527499.2]
6: 310000 a (508749.9,512499.8]
table(db[db$prod=='a',]$id)
(504970,508749.9] (508749.9,512499.8] (512499.8,516249.6] (516249.6,519999.5] (519999.5,523749.4] (523749.4,527499.2] (527499.2,531249.1] (531249.1,535029]
3 1 1 0 1 2 1 2
table(cut(db[db$prod=='a',]$count,8,digits=1,dig.lab = 7))
(304970,308749.9] (308749.9,312499.8] (312499.8,316249.6] (316249.6,319999.5] (319999.5,323749.4] (323749.4,327499.2] (327499.2,331249.1] (331249.1,335029]
3 1 1 0 1 2 1 2
Upvotes: 2
Views: 1629
Reputation: 20085
You can try to first convert group-wise output of cut
to character
and add a column. Now in the next statement convert new column (containing character) to factor
without grouping on anything. In that way you will be able avoid groupwise factors coercing to same values.
You can try as:
library(data.table)
# Data
db<-data.table(count=c(331948, 334999, 321000, 305000, 324100, 310000, 305000,
325000, 305000, 329999, 315000,531948, 534999, 521000, 505000, 524100,
510000, 505000, 525000, 505000, 529999, 515000),
prod=c("a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b",
"b","b","b","b","b","b"))
# Try this
db[, id := as.character(cut(count, 8, digits=1, dig.lab = 7)), by=.(prod)][
,id:=as.factor(id)]
# Result
db
count prod id
# 1: 331948 a (331249.1,335029]
# 2: 334999 a (331249.1,335029]
# 3: 321000 a (319999.5,323749.4]
# 4: 305000 a (304970,308749.9]
# 5: 324100 a (323749.4,327499.2]
# 6: 310000 a (308749.9,312499.8]
# 7: 305000 a (304970,308749.9]
# 8: 325000 a (323749.4,327499.2]
# 9: 305000 a (304970,308749.9]
# 10: 329999 a (327499.2,331249.1]
# 11: 315000 a (312499.8,316249.6]
# 12: 531948 b (531249.1,535029]
# 13: 534999 b (531249.1,535029]
# so on
Upvotes: 2