nara
nara

Reputation: 176

R: strange behavior in data.table when attempting to operate on a column using another data.table

I'm trying to operate on a data.table column using a different data.table, and assign the result to a new column in the first data.table. But I keep having this issue:

Warning messages:
1: In from:(from + len) :
  numerical expression has 10 elements: only the first used

Here is the data:

tstamps = c(1504306173, NA, NA, NA, NA, 1504393006, NA, NA, 1504459211, NA)
set.seed(0.1)
dt1 = data.table(utc_tstamp = sample(rep(tstamps, 100), 100))
dt2 = data.table(from = sample((1:90), 10), len = sample(1:10, 10))

> dt2
    from len
 1:   55   6
 2:   59   9
 3:   32  10
 4:   24   3
 5:   86   7
 6:   54   1
 7:   18   5
 8:   11   8
 9:   40   4
10:   75   2

I'm trying to count the number of NA's in dt1[from:(from+len), ] and assign the result to a new column, count, in dt2.

What I currently have for that is this

dt2[, count := dt1[from:(from+len), ][is.na(utc_tstamp), .N]]

but this is only using dt2[1,]$from and dt2[1,]$len, all the counts are just the number of NA's in dt1[dt2[1,]$from:(dt2[1,]$from + dt2[1,]$len), ], and I receive the following warning

Warning messages:
1: In from:(from + len) :
  numerical expression has 10 elements: only the first used
2: In from:(from + len) :
  numerical expression has 10 elements: only the first used

and the result is this:

> dt2
    from len count
 1:   55   6     5
 2:   59   9     5
 3:   32  10     5
 4:   24   3     5
 5:   86   7     5
 6:   54   1     5
 7:   18   5     5
 8:   11   8     5
 9:   40   4     5
10:   75   2     5

while it should be this:

> dt2
    from len count
 1:   55   6     5
 2:   59   9     5
 3:   32  10     8
 4:   24   3     3
 5:   86   7     5
 6:   54   1     2
 7:   18   5     4
 8:   11   8     5
 9:   40   4     4
10:   75   2     2

I'd appreciate it if someone explains why this is happening and how can I get what I want.

Upvotes: 4

Views: 124

Answers (2)

Frank
Frank

Reputation: 66819

Or define the joining variables explicitly and use a non-equi join:

dt2[, to := from+len]
dt1[, r := .I]

dt2[, ct := dt1[is.na(utc_tstamp)][dt2, on=.(r >= from, r <= to), .N, by=.EACHI]$N]

Upvotes: 2

akrun
akrun

Reputation: 887138

Based on the description, we get the sequence between the 'from' and 'from' added with 'len', based on this position index get the corresponding elements of 'utc_stamp' column from 'dt1', convert it to logical (is.na(), and get the sum i.e. sum of TRUE elements or the number of NA elements. Assign (:=) it to create a new column 'count' in 'df2'

dt2[, count := unlist(Map(function(x, y) 
              sum(is.na(dt1$utc_tstamp[x:y])),  from , from + len))]
dt2
#    from len count
# 1:   55   6     5
# 2:   59   9     5
# 3:   32  10     8
# 4:   24   3     3
# 5:   86   7     5
# 6:   54   1     2
# 7:   18   5     4
# 8:   11   8     5
# 9:   40   4     4
#10:   75   2     2

Or another option is to group by sequence of rows and then do the the sequence (:) based on 'from', 'len' columns to subset the column values from 'dt1' and get the sum of logical vector

dt2[, count := sum(is.na(dt1$utc_tstamp[from:(from + len)])), by = 1:nrow(dt2)]

Upvotes: 2

Related Questions