ialm
ialm

Reputation: 8727

Find and replace values with data.table in R?

After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.

I am having trouble with a particular task:

For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.

I can do this for the entire data.table using the set command, but I need to do this operation by station.

Example:

# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE), 
                 wind=rgamma(n=1e6, shape=1.5, rate=1/10),
                 other=rnorm(n=1.6),
                 key="station")

# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate), 
    j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)

The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.

I tried to use the by=station operator within the [ ], but I am not sure how to do this.

Upvotes: 3

Views: 7163

Answers (1)

Arun
Arun

Reputation: 118889

One way is to get the row numbers you've to replace using the special variable .I and then assign NA to those rows by reference using the := operator (or set).

# get the row numbers
idx = DT[, .I[which(c(NA, diff(wind)) > 35)], by=station][, V1]
# then assign by reference
DT[idx, wind := NA_real_]

This FR #2793 filed by @eddi when/if implemented will have a much more natural way to accomplish this task by providing the expression resulting in the corresponding indices on LHS and the value to replace with on RHS. That is, in the future, we should be able to do:

# in the future - a more natural way of doing the same operation shown above.
DT[, wind[which(c(NA, diff(wind)) > 35)] := NA_real_, by=station]

Upvotes: 5

Related Questions