lolo
lolo

Reputation: 646

R CUT Data Table

I have the following problem when doing a cut on a data.table. I don't know why it doesn't do the filter by "prod" correctly. If I run the cut on the outside, as seen below, it does the cut correctly but not inside the data table. Do you know why and how I can fix it? Thank you

library(data.table)
    db<-data.frame(count=c(331948, 334999, 321000, 305000, 324100, 310000, 305000, 325000, 305000, 329999, 315000,531948, 534999, 521000, 505000, 524100, 510000, 505000, 525000, 505000, 529999, 515000), prod=c("a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b"))

head(db)

    count prod
1  331948    a
2  334999    a
3  321000    a
4  305000    a
5  324100    a
6  310000    a

setDT(db)[ , id := cut(count,8,digits=1,dig.lab = 7), by = prod]

    count prod                  id
1: 331948    a   (531249.1,535029]
2: 334999    a   (531249.1,535029]
3: 321000    a (519999.5,523749.4]
4: 305000    a   (504970,508749.9]
5: 324100    a (523749.4,527499.2]
6: 310000    a (508749.9,512499.8]

table(db[db$prod=='a',]$id)

 (504970,508749.9] (508749.9,512499.8] (512499.8,516249.6] (516249.6,519999.5] (519999.5,523749.4] (523749.4,527499.2] (527499.2,531249.1]   (531249.1,535029] 
                  3                   1                   1                   0                   1                   2                   1                   2 

table(cut(db[db$prod=='a',]$count,8,digits=1,dig.lab = 7))

(304970,308749.9] (308749.9,312499.8] (312499.8,316249.6] (316249.6,319999.5] (319999.5,323749.4] (323749.4,327499.2] (327499.2,331249.1]   (331249.1,335029] 
                  3                   1                   1                   0                   1                   2                   1                   2 

Upvotes: 2

Views: 1629

Answers (1)

MKR
MKR

Reputation: 20085

You can try to first convert group-wise output of cut to character and add a column. Now in the next statement convert new column (containing character) to factor without grouping on anything. In that way you will be able avoid groupwise factors coercing to same values.

You can try as:

library(data.table)

# Data
db<-data.table(count=c(331948, 334999, 321000, 305000, 324100, 310000, 305000, 
325000, 305000, 329999, 315000,531948, 534999, 521000, 505000, 524100,
510000, 505000, 525000, 505000, 529999, 515000),
prod=c("a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b",
      "b","b","b","b","b","b"))

# Try this
db[, id := as.character(cut(count, 8, digits=1, dig.lab = 7)), by=.(prod)][
    ,id:=as.factor(id)]

# Result 
db
      count prod                  id
# 1: 331948    a   (331249.1,335029]
# 2: 334999    a   (331249.1,335029]
# 3: 321000    a (319999.5,323749.4]
# 4: 305000    a   (304970,308749.9]
# 5: 324100    a (323749.4,327499.2]
# 6: 310000    a (308749.9,312499.8]
# 7: 305000    a   (304970,308749.9]
# 8: 325000    a (323749.4,327499.2]
# 9: 305000    a   (304970,308749.9]
# 10: 329999    a (327499.2,331249.1]
# 11: 315000    a (312499.8,316249.6]
# 12: 531948    b   (531249.1,535029]
# 13: 534999    b   (531249.1,535029] 
# so on 

Upvotes: 2

Related Questions