broccoli
broccoli

Reputation: 4846

R data.table subsetting within a group and splitting a data table into two

I have the following data.table.

ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b

I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows

ts,id
1,a
2,a
3,a
4,a
1,b
2,b

and

 ts,id
 5,a
 6,a
 7,a
 3,b
 4,b

I tried the following,

z1 = x[,.SD[.I < .N/2,],by=dev]
z1

and got just the following

id ts
a  1
a  2
a  3

Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated. Thanks in advance.

Upvotes: 4

Views: 1367

Answers (2)

eddi
eddi

Reputation: 49448

Here's the corrected version of your expression:

dt[, .SD[, .SD[.I <= .N/2]], by = id]
#   id ts
#1:  a  1
#2:  a  2
#3:  a  3
#4:  b  1
#5:  b  2

The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).

Upvotes: 2

mnel
mnel

Reputation: 115485

.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.

Something like

DT[, subset := seq_len(.N) > .N/2,by='id']

subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]

subset1
#    ts id
# 1:  4  a
# 2:  5  a
# 3:  6  a
# 4:  7  a
# 5:  3  b
# 6:  4  b
subset2
#   ts id
# 1:  1  a
# 2:  2  a
# 3:  3  a
# 4:  1  b
# 5:  2  b

Should work

For more than 2 groups, you could use cut to create a factor with the appropriate number of levels

Something like

 DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
 # you could copy to the global environment a subset for each, but this 
 # will not be memory efficient!
 list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)

Upvotes: 6

Related Questions