Reputation: 53
I have a data.table that consists of several groups (hierarchical panel/longitude dataset to be more specific), and one cell within a group looks like this
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
# that is:
# x t
# 1: 10.0 1970-01-28
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 5: 14.2 1970-02-06 # to be removed since 14.2-14.0 = 0.2 <0.5
# 6: 14.4 1970-02-07 # to be removed since 14.4-14.2 = 0.2 <0.5 and 14.4-14.0 = 0.4 <0.5
# 7: 14.6 1970-02-08 # shall NOT be removed because 14.6-14.0 = 0.6 > 0.5
# 8: 17.0 1970-02-09
# 9: 17.4 1970-02-10 # to be removed
# 10: 30.0 1970-02-11
For simplicity, the groups are excluded, so just assume there is only two variables (columns) from the data:
I need to drop the observations with inter-row differences that are less than 0.5 between any two rows nearby, so what I need would like this
# x t
# 1: 10.0 1970-01-31
# 2: 10.5 1970-02-02
# 3: 11.1 1970-02-03
# 4: 14.0 1970-02-04
# 7: 14.6 1970-02-08
# 8: 17.0 1970-02-09
# 10: 30.0 1970-02-11
Finally it satisfies that any two values in neighbor has no less than 0.5 difference in the order of the variable t.
Is it possible for a data.table like this, but much larger, with several groups and nearly 100 million observations.
Thank you in advanced!
Upvotes: 1
Views: 299
Reputation: 14764
If I understood correctly, you could do:
library(data.table)
z <- z[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][
, filt := ifelse(x == filt,
shift(x, fill = x[1]),
filt)][
x - filt >= 0.5 | x == filt, ][, filt := NULL]
Explanation:
x
by each group;cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))
. Therein, we check for each row whether x >= shift(x) + 0.5
(difference between x
and previous row is larger or equal to 0.5). This evaluates to TRUE
or FALSE
which we turn to 1 and 0 with the +
sign; as the first row will always be NA
(as there is no previous one), we remove it with [-1]
after the expression. As this means the first value will be missing from the vector, we construct another one which begins with 1 and is followed by what we have computed before. Afterwards we apply the cumsum
- the latter assigns a value each time when there is a new row larger or equal than previous one + 0.5; if there is no such row in-between, it continues assigning the last number (as we have inserted 1 as the beginning of vector, it will start at 1 and increase by +1 every time it'll encounter the row which satisfied the condition for non-exclusion);Output:
x t
1: 10.0 1970-01-28
2: 10.5 1970-02-02
3: 11.1 1970-02-03
4: 14.0 1970-02-04
5: 14.6 1970-02-08
6: 17.0 1970-02-09
7: 30.0 1970-02-11
Upvotes: 2
Reputation: 2425
As the gap is dependent on the sequential removal of the rows, the solution below uses an interative approach to identify and re-calculate the subsequent gap after a row is removed.
z <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30),
t = as.Date(c(27, 32:34, 36:41)))
setkeyv(z,"t")
find_gaps <- function(dt) {
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
gap <- which(gaps==TRUE)[1]
#print(paste0("Removing row: ",gap))
return (gap)
}
while(!is.na(gap<-find_gaps(z))) { z <- z[-gap] }
z
Results:
[1] "removing row: 5"
[1] "removing row: 5"
[1] "removing row: 7"
> z
x t last_x gap
1: 10.0 1970-01-28 NA FALSE
2: 10.5 1970-02-02 10.0 FALSE
3: 11.1 1970-02-03 10.5 FALSE
4: 14.0 1970-02-04 11.1 FALSE
5: 14.6 1970-02-08 14.0 FALSE
6: 17.0 1970-02-09 14.6 FALSE
7: 30.0 1970-02-11 17.0 FALSE
Noting the 8gb file and an eye for efficiency: proposing a good old for loop() as the most efficient
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x
last_x <- x[1]
gaps <- c()
for (i in 2:length(x))
{
if (abs(x[i]-last_x) < 0.5) gaps <- c(gaps,i)
else last_x <- x[i]
}
z1 <- z1[-(gaps)]
microbenchmark::microbenchmark(times=100,
forway={
z1 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z1,"t")
x <- z1$x; last_x <- x[1]; gaps <- c()
for (i in 2:length(x)) { if (abs(x[i]-last_x) < 0.5) { gaps <- c(gaps,i); } else { last_x <- x[i]; } }
z1 <- z1[-(gaps)]
},
datatableway={
z2 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z2,"t")
z2 <- z2[, filt := min(x), by = cumsum(c(1, +(x >= shift(x) + 0.5)[-1]))][, filt := ifelse(x == filt, shift(x, fill = x[1]), filt)][x - filt >= 0.5 | x == filt, ][, filt := NULL]
},
whileway={
z3 <- data.table(x = c(10, 10.5, 11.1, 14, 14.2, 14.4, 14.6, 17, 17.4, 30), t = as.Date(c(27, 32:34, 36:41))) ; setkeyv(z3,"t")
find_gaps <- function(dt) {
dt[, last_x := shift(.SD, n=1, fill=NA, type="lag"), .SDcols="x"]
gaps <- dt[,abs(x-last_x) < 0.5,]
which(gaps==TRUE)[1]
}
while(!is.na(gap<-find_gaps(z3))) { z3 <- z3[-gap] }
}
)
(z1==z2) & (z2==z3[,.(x,t)])
Unit: milliseconds
expr min lq mean median uq max neval
forway 2.741609 3.607341 4.067566 4.069382 4.556219 5.61997 100
datatableway 7.552005 8.915333 9.839475 9.606205 10.762764 15.46430 100
whileway 13.903507 19.059612 20.692397 20.577014 22.243933 27.44271 100
>
> (z1==z2) & (z2==z3[,.(x,t)])
x t
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
[6,] TRUE TRUE
[7,] TRUE TRUE
Upvotes: 1
Reputation: 3183
You can use dplyr::mutate
and filter
:
z %>%
mutate(diff = lead(x, 1) - x) %>%
filter(diff >= 0.5 | is.na(diff)) %>%
select(-diff)
I kept diff
field for easy understanding purpose. You can do this in single filter statement also
Upvotes: 0