Saurabh
Saurabh

Reputation: 1626

Subset data.table containing groups with at least 2 non-NA values

In the data.table below, I want to subset only the groups (in column x), which have at-least 2 non-NA values.

dt <- data.table(x = c(rep("a", 5), rep("b", 3), rep("c", 3), rep("d", 3), rep("e", 4)), y = c(NA, 1, 2, 3, NA, NA, 2, 3, 1, 2, 3, NA, NA, 1, 1, 2, NA, NA))
    x  y
 1: a NA
 2: a  1
 3: a  2
 4: a  3
 5: a NA
 6: b NA
 7: b  2
 8: b  3
 9: c  1
10: c  2
11: c  3
12: d NA
13: d NA
14: d  1
15: e  1
16: e  2
17: e NA
18: e NA

The correct solution is too slow on the large data sets.

dt[, .SD[sum(!is.na(y)) >= 2], by = "x"]

The alternate solution is fast, but I don't know how to put the condition to check for non-NA values.

dt[, if(.N >= 2L) .SD, by = "x"]

Here is the performance

microbenchmark::microbenchmark(
      a = dt[, .SD[sum(!is.na(y)) >= 2], by = "x"],
      b = dt[, if(.N >= 2L) .SD, by = "x"],
      times = 100)
Unit: microseconds
 expr      min       lq      mean    median       uq      max neval cld
    a 1953.548 1977.380 2026.2570 2021.6775 2055.699 2276.247   100   b
    b  862.626  889.073  936.8346  906.7315  918.019 3744.886   100  a

Upvotes: 2

Views: 60

Answers (2)

Saurabh
Saurabh

Reputation: 1626

The following turned out to be even faster -

dt[, if(length(na.omit(y)) >= 2) .SD, by = "x"]

Here are the performance results

    microbenchmark::microbenchmark(
      a = dt[, .SD[sum(!is.na(y)) >= 2], by = "x"],
      c = dt[dt[, .I[sum(!is.na(y)) >= 2], by = "x"]$V1],
      d = dt[, if(length(na.omit(y)) >= 2) .SD, by = "x"],
      times = 1000)

    Unit: microseconds
    expr      min        lq      mean    median        uq      max neval cld
    a 1927.523 1967.3550 2063.7377 1991.9865 2093.7970 24554.46  1000   c
    c 1131.147 1163.5625 1291.7830 1196.6600 1245.2065 23912.77  1000  b
    d  888.337  916.1545  972.7813  927.0180  971.1625 25220.55  1000 a

Upvotes: 0

akrun
akrun

Reputation: 887118

A faster option would be to get the index with .I and extract that row index

dt[dt[, .I[sum(!is.na(y)) >= 2], by = "x"]$V1]

-benchmarks

microbenchmark::microbenchmark(
      a = dt[, .SD[sum(!is.na(y)) >= 2], by = "x"],
      b = dt[dt[, .I[sum(!is.na(y)) >= 2], by = "x"]$V1],

      times = 100)
Unit: microseconds
 expr     min       lq      mean    median        uq      max neval
    a 898.833 943.8020 1095.5831 1037.7235 1105.3080 4366.111   100
    b 516.667 579.7975  622.3739  613.9905  661.9465 1010.461   100

Upvotes: 3

Related Questions