Reputation: 1594
I have a data.table
with this structure:
Classes ‘data.table’ and 'data.frame': 1336 obs. of 5 variables:
$ timestamp: POSIXct, format: "2013-02-01 00:03:49" "2013-02-01 00:03:49" "2013-02-01 00:07:54" ...
$ hour : int 1 1 1 1 1 1 1 1 1 1 ...
$ price : num 21 22 21 22 21 22 35 35.5 35.9 38 ...
$ qty : num 50 20 50 20 50 20 15 20 3 30 ...
$ timegroup: int 1 250 506 757 758 1004 1253 1 250 506 ...
- attr(*, ".internal.selfref")=<externalptr>
Example data are:
> df
timestamp hour price qty timegroup
1: 2013-02-01 00:03:49 1 21 50 1
2: 2013-02-01 00:03:49 1 22 20 1
3: 2013-02-01 00:07:54 1 21 50 1
4: 2013-02-01 00:07:54 1 22 20 1
5: 2013-02-01 00:11:59 1 21 50 1
---
1332: 2013-04-07 00:12:10 1 40 50 1
1333: 2013-04-07 00:12:10 1 47 50 1
1334: 2013-04-07 00:12:10 1 53 15 1
1335: 2013-04-07 00:12:10 1 78 50 1
1336: 2013-04-07 00:12:10 1 345 25 1
And I am trying to clean the data, because there are duplicit entries at different times. For example rows 3 and 4 should be deleted because they are duplicit with row 1 and 2, only registered at different time. I am trying to achieve this by generating groups of timestamps and then comparing the subsequent groups among themselves. But I got stuck at generating the groups of date-times.
groups <- unique(df$timestamp)
df[,timegroup:=which(timestamp==groups)]
but for some unknown reason the timegroup
column does not want to create itself. Reason is this error, which I does not help me much
Warning messages:
1: In `==.default`(timestamp, groups) :
longer object length is not a multiple of shorter object length
2: In `[.data.table`(df, , `:=`(timegroup, which(timestamp == groups))) :
Supplied 7 items to be assigned to 1336 items of column 'timegroup' (recycled leaving remainder of 6 items).
Also sapply
and for
loop do work.
Can anyone tell me why? It seems to be somehow connected with the format... Thank you.
Upvotes: 1
Views: 1151
Reputation: 49448
The answer to your immediate problem is this:
df[, timegroup := .GRP, by = timestamp]
I don't think I understand too well the general problem you're facing to suggest a solution for that.
My relatively wild guess is that you want this:
df = data.table(timestamp = c(1,1,2,2,3,3), var1 = c(1,2,1,2,1,3), var2 = c(1,2,1,2,1,4))
groups = unique(df$timestamp)
groups.duplicated = c(FALSE, sapply(seq_along(groups)[-1], function(i) {
identical(df[timestamp == groups[i-1], -1],
df[timestamp == groups[i], -1])
}))
df[timestamp %in% groups[!groups.duplicated]]
# timestamp var1 var2
#1: 1 1 1
#2: 1 2 2
#3: 3 1 1
#4: 3 3 4
Upvotes: 2