Reputation: 101
I am attempting to create ~1000 features for a machine learning problem using data.table. I have two tables that are linked by an id. The first table has a unique id per row, we'll call x. The second table has multiple rows for every id, x. In addition I have three other columns that are factor type variables. I also have a bunch of other numeric columns that I need to use. My goal is to calculate the min, max, mean for each numeric variable grouped by x and the other factor variables then reshape the information such that there is only one unique x id per row by creating a column for each combination of the factor columns along with the associated calculated numeric column. Since I have many numeric columns, I would also like to do this without having to hard code the columns and without a loop, since there are many numeric columns.
As an example I can create a data.table in the basic structure with:
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=8),
y1=c("A","B"), y2=c("C","D", "E"),y3=c("F","G"), v1=sample(1:100,12),
v2=sample(1:100,12), v3=sample(1:100,12))
DT
x y1 y2 y3 v1 v2 v3
1: 1 A C F 12 29 22
2: 1 B D G 62 92 81
3: 1 A E F 60 100 52
4: 1 B C G 61 82 89
5: 1 A D F 83 28 80
6: 1 B E G 97 26 5
7: 1 A C F 1 18 43
8: 1 B D G 22 22 25
9: 2 A E F 99 30 29
10: 2 B C G 47 96 47
11: 2 A D F 63 15 17
12: 2 B E G 49 4 68
13: 2 A C F 12 29 22
14: 2 B D G 62 92 81
15: 2 A E F 60 100 52
16: 2 B C G 61 82 89
17: 3 A D F 83 28 80
18: 3 B E G 97 26 5
19: 3 A C F 1 18 43
20: 3 B D G 22 22 25
21: 3 A E F 99 30 29
22: 3 B C G 47 96 47
23: 3 A D F 63 15 17
24: 3 B E G 49 4 68
x y1 y2 y3 v1 v2 v3
And then create an example grouping using:
interim <- DT[,list(v1min=min(v1),
v1max=max(v1),
v1mean=mean(v1),
v2min=min(v2),
v2max=max(v2),
v2mean=mean(v2),
v3min=min(v3),
v3max=max(v3),
v3mean=mean(v3)),
by=.(x,y1,y2,y3)]
interim
x y1 y2 y3 v1min v1max v1mean v2min v2max v2mean v3min v3max v3mean
1: 1 A C F 1 12 6.5 18 29 23.5 22 43 32.5
2: 1 B D G 22 62 42.0 22 92 57.0 25 81 53.0
3: 1 A E F 60 60 60.0 100 100 100.0 52 52 52.0
4: 1 B C G 61 61 61.0 82 82 82.0 89 89 89.0
5: 1 A D F 83 83 83.0 28 28 28.0 80 80 80.0
6: 1 B E G 97 97 97.0 26 26 26.0 5 5 5.0
7: 2 A E F 60 99 79.5 30 100 65.0 29 52 40.5
8: 2 B C G 47 61 54.0 82 96 89.0 47 89 68.0
9: 2 A D F 63 63 63.0 15 15 15.0 17 17 17.0
10: 2 B E G 49 49 49.0 4 4 4.0 68 68 68.0
11: 2 A C F 12 12 12.0 29 29 29.0 22 22 22.0
12: 2 B D G 62 62 62.0 92 92 92.0 81 81 81.0
13: 3 A D F 63 83 73.0 15 28 21.5 17 80 48.5
14: 3 B E G 49 97 73.0 4 26 15.0 5 68 36.5
15: 3 A C F 1 1 1.0 18 18 18.0 43 43 43.0
16: 3 B D G 22 22 22.0 22 22 22.0 25 25 25.0
17: 3 A E F 99 99 99.0 30 30 30.0 29 29 29.0
18: 3 B C G 47 47 47.0 96 96 96.0 47 47 47.0
The ideal output would then only have 3 rows - one for each unique x (1,2,3) with the following columns (some NA's are expected):
x | A-C-F-v1min | A-C-F-v1max | A-C-F-v1mean | . . . | B-C-G-v3min | B-C-G-v3max | B-C-G-v3mean
Upvotes: 1
Views: 70
Reputation: 49448
This will do it:
dcast(interim, x~y1+y2+y3, value.var = setdiff(names(interim), c('x', 'y1', 'y2', 'y3')))
I can't remember if multiple value.var
's was present in 1.9.6, so you might need to get the latest development version.
Upvotes: 3