Reputation: 688
I have a data.table
that looks like this:
df
seller_id buyer_id hs10 year mean_ln_c_hit
1: 1 5 1200000000 5 4.407456
2: 1 24 1500000000 2 4.173422
3: 1 8 1500000000 3 4.532695
4: 1 8 1500000000 5 4.830106
5: 1 16 1500000000 5 4.830106
6: 1 17 1800000000 3 3.920435
7: 1 20 1800000000 5 4.357196
8: 1 27 1800000000 5 4.357196
9: 1 1 1900000000 1 4.835762
10: 1 2 1900000000 1 4.835762
11: 1 3 1900000000 1 4.835762
12: 1 4 1900000000 1 4.835762
13: 1 5 1900000000 1 4.835762
14: 1 6 1900000000 1 4.835762
15: 1 7 1900000000 1 4.835762
16: 1 8 1900000000 1 4.835762
17: 1 9 1900000000 1 4.835762
18: 1 11 1900000000 1 4.835762
19: 1 12 1900000000 1 4.835762
20: 1 13 1900000000 1 4.835762
21: 1 14 1900000000 1 4.835762
22: 1 15 1900000000 1 4.835762
23: 1 16 1900000000 1 4.835762
24: 1 17 1900000000 1 4.835762
25: 1 18 1900000000 1 4.835762
26: 1 19 1900000000 1 4.835762
27: 1 20 1900000000 1 4.835762
28: 1 21 1900000000 1 4.835762
29: 1 22 1900000000 1 4.835762
30: 1 23 1900000000 1 4.835762
31: 1 24 1900000000 1 4.835762
32: 1 25 1900000000 1 4.835762
33: 1 26 1900000000 1 4.835762
34: 1 27 1900000000 1 4.835762
35: 1 28 1900000000 1 4.835762
36: 1 29 1900000000 1 4.835762
37: 1 30 1900000000 1 4.835762
38: 1 1 1900000000 2 4.409253
39: 1 2 1900000000 2 4.409253
40: 1 3 1900000000 2 4.409253
41: 1 4 1900000000 2 4.409253
42: 1 5 1900000000 2 4.409253
43: 1 6 1900000000 2 4.409253
44: 1 7 1900000000 2 4.409253
45: 1 8 1900000000 2 4.409253
46: 1 9 1900000000 2 4.409253
47: 1 10 1900000000 2 4.409253
48: 1 11 1900000000 2 4.409253
49: 1 12 1900000000 2 4.409253
50: 1 13 1900000000 2 4.409253
51: 1 14 1900000000 2 4.409253
52: 1 15 1900000000 2 4.409253
53: 1 16 1900000000 2 4.409253
54: 1 17 1900000000 2 4.409253
55: 1 18 1900000000 2 4.409253
56: 1 19 1900000000 2 4.409253
57: 1 20 1900000000 2 4.409253
58: 1 21 1900000000 2 4.409253
59: 1 22 1900000000 2 4.409253
60: 1 23 1900000000 2 4.409253
61: 1 25 1900000000 2 4.409253
62: 1 26 1900000000 2 4.409253
63: 1 27 1900000000 2 4.409253
64: 1 28 1900000000 2 4.409253
65: 1 29 1900000000 2 4.409253
66: 1 30 1900000000 2 4.409253
67: 1 1 1900000000 3 4.514642
68: 1 3 1900000000 3 4.514642
69: 1 4 1900000000 3 4.514642
70: 1 5 1900000000 3 4.514642
71: 1 6 1900000000 3 4.514642
72: 1 7 1900000000 3 4.514642
73: 1 9 1900000000 3 4.514642
74: 1 11 1900000000 3 4.514642
75: 1 12 1900000000 3 4.514642
76: 1 13 1900000000 3 4.514642
77: 1 14 1900000000 3 4.514642
78: 1 15 1900000000 3 4.514642
79: 1 16 1900000000 3 4.514642
80: 1 18 1900000000 3 4.514642
81: 1 19 1900000000 3 4.514642
82: 1 20 1900000000 3 4.514642
83: 1 21 1900000000 3 4.514642
84: 1 22 1900000000 3 4.514642
85: 1 23 1900000000 3 4.514642
86: 1 24 1900000000 3 4.514642
87: 1 25 1900000000 3 4.514642
88: 1 26 1900000000 3 4.514642
89: 1 27 1900000000 3 4.514642
90: 1 28 1900000000 3 4.514642
91: 1 29 1900000000 3 4.514642
92: 1 30 1900000000 3 4.514642
93: 1 2 1900000000 5 4.698335
94: 1 3 1900000000 5 4.698335
95: 1 4 1900000000 5 4.698335
96: 1 6 1900000000 5 4.698335
97: 1 7 1900000000 5 4.698335
98: 1 9 1900000000 5 4.698335
99: 1 11 1900000000 5 4.698335
100: 1 12 1900000000 5 4.698335
I want to use data.table
feature to create a new column called lag_mean_ln_c_hit
which contains the one-year lag lag
of the value in column mean_ln_c_hit
, conditional on the seller_id
and hs10
. My best attempt is:
df[!is.na(year), lag_mean_ln_c_hit:=
(.SD[.(seller_id = seller_id, hs10 = hs10, year = year - 1), mean_ln_c_hit, on = c("seller_id", "hs10", "year"), allow.cartesian = TRUE ]) ]
which kind of correctly pics when to display NAs
(i.e., year = 1
is the first year, so the lag for all observations in the first year
should be NA), but fails to capture the right lags. It always report 4.835762
for any other combination that is not NA
(albeit once it reports 4.173422
).
Each seller_id
has multiple buyer_id
entries with the same mean_ln_c_hit
value. Therefore I have duplicates. Running the code, I get the following Warning as well
Warning message:
In `[.data.table`(df, !is.na(year), `:=`(lag_mean_ln_c_hit, (.SD[.(seller_id = seller_id, :
Supplied 1640 items to be assigned to 100 items of column 'lag_mean_ln_c_hit' (1540 unused)
The code would not run without allow.cartesian = TRUE
.
Any idea on how to solve this issue? It should be pretty simple, but I can't get my head around it.
Upvotes: 0
Views: 49
Reputation: 25225
No clear what is the desired output. If it has to be exactly one year ago, here is an option:
seller <- unique(df, by=c("seller_id", "hs10", "year"))
df[, oneyearago := year - 1L]
df[, lag_mean_ln_c_hit := seller[.SD, on=.(seller_id, hs10, year=oneyearago), x.mean_ln_c_hit]]
output:
seller_id buyer_id hs10 year mean_ln_c_hit oneyearago lag_mean_ln_c_hit
1: 1 5 1200000000 5 4.407456 4 NA
2: 1 24 1500000000 2 4.173422 1 NA
3: 1 8 1500000000 3 4.532695 2 4.173422
4: 1 8 1500000000 5 4.830106 4 NA
5: 1 16 1500000000 5 4.830106 4 NA
6: 1 17 1800000000 3 3.920435 2 NA
7: 1 20 1800000000 5 4.357196 4 NA
8: 1 27 1800000000 5 4.357196 4 NA
9: 1 1 1900000000 1 4.835762 0 NA
10: 1 2 1900000000 1 4.835762 0 NA
11: 1 3 1900000000 1 4.835762 0 NA
12: 1 4 1900000000 1 4.835762 0 NA
13: 1 5 1900000000 1 4.835762 0 NA
14: 1 6 1900000000 1 4.835762 0 NA
15: 1 7 1900000000 1 4.835762 0 NA
16: 1 8 1900000000 1 4.835762 0 NA
17: 1 9 1900000000 1 4.835762 0 NA
18: 1 11 1900000000 1 4.835762 0 NA
19: 1 12 1900000000 1 4.835762 0 NA
20: 1 13 1900000000 1 4.835762 0 NA
21: 1 14 1900000000 1 4.835762 0 NA
22: 1 15 1900000000 1 4.835762 0 NA
23: 1 16 1900000000 1 4.835762 0 NA
24: 1 17 1900000000 1 4.835762 0 NA
25: 1 18 1900000000 1 4.835762 0 NA
26: 1 19 1900000000 1 4.835762 0 NA
27: 1 20 1900000000 1 4.835762 0 NA
28: 1 21 1900000000 1 4.835762 0 NA
29: 1 22 1900000000 1 4.835762 0 NA
30: 1 23 1900000000 1 4.835762 0 NA
31: 1 24 1900000000 1 4.835762 0 NA
32: 1 25 1900000000 1 4.835762 0 NA
33: 1 26 1900000000 1 4.835762 0 NA
34: 1 27 1900000000 1 4.835762 0 NA
35: 1 28 1900000000 1 4.835762 0 NA
36: 1 29 1900000000 1 4.835762 0 NA
37: 1 30 1900000000 1 4.835762 0 NA
38: 1 1 1900000000 2 4.409253 1 4.835762
39: 1 2 1900000000 2 4.409253 1 4.835762
40: 1 3 1900000000 2 4.409253 1 4.835762
41: 1 4 1900000000 2 4.409253 1 4.835762
42: 1 5 1900000000 2 4.409253 1 4.835762
43: 1 6 1900000000 2 4.409253 1 4.835762
44: 1 7 1900000000 2 4.409253 1 4.835762
45: 1 8 1900000000 2 4.409253 1 4.835762
46: 1 9 1900000000 2 4.409253 1 4.835762
47: 1 10 1900000000 2 4.409253 1 4.835762
48: 1 11 1900000000 2 4.409253 1 4.835762
49: 1 12 1900000000 2 4.409253 1 4.835762
50: 1 13 1900000000 2 4.409253 1 4.835762
51: 1 14 1900000000 2 4.409253 1 4.835762
52: 1 15 1900000000 2 4.409253 1 4.835762
53: 1 16 1900000000 2 4.409253 1 4.835762
54: 1 17 1900000000 2 4.409253 1 4.835762
55: 1 18 1900000000 2 4.409253 1 4.835762
56: 1 19 1900000000 2 4.409253 1 4.835762
57: 1 20 1900000000 2 4.409253 1 4.835762
58: 1 21 1900000000 2 4.409253 1 4.835762
59: 1 22 1900000000 2 4.409253 1 4.835762
60: 1 23 1900000000 2 4.409253 1 4.835762
61: 1 25 1900000000 2 4.409253 1 4.835762
62: 1 26 1900000000 2 4.409253 1 4.835762
63: 1 27 1900000000 2 4.409253 1 4.835762
64: 1 28 1900000000 2 4.409253 1 4.835762
65: 1 29 1900000000 2 4.409253 1 4.835762
66: 1 30 1900000000 2 4.409253 1 4.835762
67: 1 1 1900000000 3 4.514642 2 4.409253
68: 1 3 1900000000 3 4.514642 2 4.409253
69: 1 4 1900000000 3 4.514642 2 4.409253
70: 1 5 1900000000 3 4.514642 2 4.409253
71: 1 6 1900000000 3 4.514642 2 4.409253
72: 1 7 1900000000 3 4.514642 2 4.409253
73: 1 9 1900000000 3 4.514642 2 4.409253
74: 1 11 1900000000 3 4.514642 2 4.409253
75: 1 12 1900000000 3 4.514642 2 4.409253
76: 1 13 1900000000 3 4.514642 2 4.409253
77: 1 14 1900000000 3 4.514642 2 4.409253
78: 1 15 1900000000 3 4.514642 2 4.409253
79: 1 16 1900000000 3 4.514642 2 4.409253
80: 1 18 1900000000 3 4.514642 2 4.409253
81: 1 19 1900000000 3 4.514642 2 4.409253
82: 1 20 1900000000 3 4.514642 2 4.409253
83: 1 21 1900000000 3 4.514642 2 4.409253
84: 1 22 1900000000 3 4.514642 2 4.409253
85: 1 23 1900000000 3 4.514642 2 4.409253
86: 1 24 1900000000 3 4.514642 2 4.409253
87: 1 25 1900000000 3 4.514642 2 4.409253
88: 1 26 1900000000 3 4.514642 2 4.409253
89: 1 27 1900000000 3 4.514642 2 4.409253
90: 1 28 1900000000 3 4.514642 2 4.409253
91: 1 29 1900000000 3 4.514642 2 4.409253
92: 1 30 1900000000 3 4.514642 2 4.409253
93: 1 2 1900000000 5 4.698335 4 NA
94: 1 3 1900000000 5 4.698335 4 NA
95: 1 4 1900000000 5 4.698335 4 NA
96: 1 6 1900000000 5 4.698335 4 NA
97: 1 7 1900000000 5 4.698335 4 NA
98: 1 9 1900000000 5 4.698335 4 NA
99: 1 11 1900000000 5 4.698335 4 NA
100: 1 12 1900000000 5 4.698335 4 NA
seller_id buyer_id hs10 year mean_ln_c_hit oneyearago lag_mean_ln_c_hit
Or if it has be at least one year ago, you can use a non-equi join and pick the last match:
seller <- unique(df, by=c("seller_id", "hs10", "year"))
df[, lag_mean_ln_c_hit := seller[.SD, on=.(seller_id, hs10, year<year), x.mean_ln_c_hit, mult="last"]]
data:
library(data.table)
df <- fread("seller_id buyer_id hs10 year mean_ln_c_hit
1 5 1200000000 5 4.407456
1 24 1500000000 2 4.173422
1 8 1500000000 3 4.532695
1 8 1500000000 5 4.830106
1 16 1500000000 5 4.830106
1 17 1800000000 3 3.920435
1 20 1800000000 5 4.357196
1 27 1800000000 5 4.357196
1 1 1900000000 1 4.835762
1 2 1900000000 1 4.835762
1 3 1900000000 1 4.835762
1 4 1900000000 1 4.835762
1 5 1900000000 1 4.835762
1 6 1900000000 1 4.835762
1 7 1900000000 1 4.835762
1 8 1900000000 1 4.835762
1 9 1900000000 1 4.835762
1 11 1900000000 1 4.835762
1 12 1900000000 1 4.835762
1 13 1900000000 1 4.835762
1 14 1900000000 1 4.835762
1 15 1900000000 1 4.835762
1 16 1900000000 1 4.835762
1 17 1900000000 1 4.835762
1 18 1900000000 1 4.835762
1 19 1900000000 1 4.835762
1 20 1900000000 1 4.835762
1 21 1900000000 1 4.835762
1 22 1900000000 1 4.835762
1 23 1900000000 1 4.835762
1 24 1900000000 1 4.835762
1 25 1900000000 1 4.835762
1 26 1900000000 1 4.835762
1 27 1900000000 1 4.835762
1 28 1900000000 1 4.835762
1 29 1900000000 1 4.835762
1 30 1900000000 1 4.835762
1 1 1900000000 2 4.409253
1 2 1900000000 2 4.409253
1 3 1900000000 2 4.409253
1 4 1900000000 2 4.409253
1 5 1900000000 2 4.409253
1 6 1900000000 2 4.409253
1 7 1900000000 2 4.409253
1 8 1900000000 2 4.409253
1 9 1900000000 2 4.409253
1 10 1900000000 2 4.409253
1 11 1900000000 2 4.409253
1 12 1900000000 2 4.409253
1 13 1900000000 2 4.409253
1 14 1900000000 2 4.409253
1 15 1900000000 2 4.409253
1 16 1900000000 2 4.409253
1 17 1900000000 2 4.409253
1 18 1900000000 2 4.409253
1 19 1900000000 2 4.409253
1 20 1900000000 2 4.409253
1 21 1900000000 2 4.409253
1 22 1900000000 2 4.409253
1 23 1900000000 2 4.409253
1 25 1900000000 2 4.409253
1 26 1900000000 2 4.409253
1 27 1900000000 2 4.409253
1 28 1900000000 2 4.409253
1 29 1900000000 2 4.409253
1 30 1900000000 2 4.409253
1 1 1900000000 3 4.514642
1 3 1900000000 3 4.514642
1 4 1900000000 3 4.514642
1 5 1900000000 3 4.514642
1 6 1900000000 3 4.514642
1 7 1900000000 3 4.514642
1 9 1900000000 3 4.514642
1 11 1900000000 3 4.514642
1 12 1900000000 3 4.514642
1 13 1900000000 3 4.514642
1 14 1900000000 3 4.514642
1 15 1900000000 3 4.514642
1 16 1900000000 3 4.514642
1 18 1900000000 3 4.514642
1 19 1900000000 3 4.514642
1 20 1900000000 3 4.514642
1 21 1900000000 3 4.514642
1 22 1900000000 3 4.514642
1 23 1900000000 3 4.514642
1 24 1900000000 3 4.514642
1 25 1900000000 3 4.514642
1 26 1900000000 3 4.514642
1 27 1900000000 3 4.514642
1 28 1900000000 3 4.514642
1 29 1900000000 3 4.514642
1 30 1900000000 3 4.514642
1 2 1900000000 5 4.698335
1 3 1900000000 5 4.698335
1 4 1900000000 5 4.698335
1 6 1900000000 5 4.698335
1 7 1900000000 5 4.698335
1 9 1900000000 5 4.698335
1 11 1900000000 5 4.698335
1 12 1900000000 5 4.698335")
Upvotes: 1