user12310746
user12310746

Reputation: 279

"recycling" error in user-defined function for data.table

I've joined two data tables and am calculating means based on a subset of that data. The code below runs properly when it's not within a function that I wrote, but I'm getting this error when I try to use the function:

Error in `[.data.table`(poll.name, AQ.Date >= Cdate & AQ.Date < Cdate +  : 
  i evaluates to a logical vector length 159 but there are 2797432 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

My function:

 myfunc <- function(linked.dat, poll.name) {

  linked.dat[,
        `:=` (t1.avg = mean(poll.name[AQ.Date >= Cdate & AQ.Date < Cdate + 1], na.rm = TRUE),
              t2.avg = mean(poll.name[AQ.Date >= Cdate + 1 & AQ.Date < Cdate + 2], na.rm = TRUE),
              t3.avg = mean(poll.name[AQ.Date >= Cdate + 2 & AQ.Date <= Bdate], na.rm = TRUE),
              total.avg = mean(poll.name)),
        by = ID]

  linked.pollname <- linked.dat

  return(linked.pollname)

}

So using this function with the example df would look like:

myfunc(df, O3) 

Some data:

df <- structure(list(O3 = c(21.1, 27.3, 23.8, 29.5, 23.8, 27.1, 31.6, 
25.8, 31.2, 14, 19.1, 15.5, 15.6, 28.6, 16.9, 27.4, 30.1, 24.4, 
21.2, 22.1, 26.1, 19.9), AQ.Date = structure(c(3679, 3681, 3682, 
3683, 3680, 3685, 3686, 3687, 3684, 3689, 3673, 3675, 3677, 3678, 
3686, 3687, 3688, 3692, 3681, 3693, 3695, 3696), class = "Date"), 
    ID = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "a", 
    "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"
    ), Cdate = structure(c(3673, 3673, 3673, 3673, 3673, 
    3673, 3673, 3673, 3673, 3673, 3673, 3673, 3677, 3677, 3677, 
    3677, 3677, 3677, 3677, 3677, 3677, 3677), class = "Date"), 
    Bdate = structure(c(3690, 3690, 3690, 3690, 3690, 3690, 
    3690, 3690, 3690, 3690, 3690, 3690, 3696, 3696, 3696, 3696, 
    3696, 3696, 3696, 3696, 3696, 3696), class = "Date"), Total_weeks = c(2.428571, 
    2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 
    2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.714286, 
    2.714286, 2.714286, 2.714286, 2.714286, 2.714286, 2.714286, 
    2.714286, 2.714286, 2.714286)), row.names = c(NA, -22L), class = "data.frame")

setDT(df) 

I'm not understanding what this error means. What is the recycling referring to? Why is it only happening within the function? How can I adjust the function to address the error?

Upvotes: 0

Views: 993

Answers (1)

r2evans
r2evans

Reputation: 160587

Recycling in general

Recycling has to do with how vectors of different lengths are combined into a data.frame (and some other places). Every column of a data.frame (and therefore a data.table and tbl_df) must be the same length, and when something is not the same length it is recycled.

In most (all?) base R functions, recycling is done silently as long as the longest vector is an even multiple of the shorter vectors. For instance,

data.frame(x = 1, y = 1:3)
#   x y
# 1 1 1
# 2 1 2
# 3 1 3
data.frame(x = 1:2, y = 1:4)
#   x y
# 1 1 1
# 2 2 2
# 3 1 3
# 4 2 4

but R will error (usually, but not in all cases) when a non-even combination is provided:

data.frame(x = 1:3, y = 1:4)
# Error in data.frame(x = 1:3, y = 1:4) : 
#   arguments imply differing number of rows: 3, 4

My personal opinion is that recycling is a balance between convenience and safety, where "convenience" is that I want to add a column with a single invariant value to a frame with multiple rows, as in the first example above; "safety" is that you are certain what each function is returning (e.g., length) and surprised are not hidden.

For the latter, consider a custom function (meant to mimic which.min) that finds the location of the minimum value:

myfunc <- function(x) which(x == min(x)) # this is naive, do not use it

With "normal" data, it will return a single value, as in

set.seed(42)
myfunc(runif(10))
# [1] 8

However, perhaps when dealing with integers or something else where equality can happen (and in some rare numeric instances), one might get more than one:

myfunc(sample(10, size = 11, replace = TRUE))
# [1]  2 10

Because of this, if you rely on it returning a single value but it instead returns two or more, then ... something you rely on might do silent recycling and you are none the wiser. For instance,

set.seed(3)
mydat <- data.frame(x = sample(10, size = 12, replace = TRUE))
mydat$y <- myfunc(mydat$x)
mydat
#     x y
# 1   5 4
# 2  10 8
# 3   7 4
# 4   4 8
# 5  10 4
# 6   8 8
# 7   8 4
# 8   4 8
# 9  10 4
# 10  7 8
# 11  8 4
# 12  8 8

From my perspective, recycling is only "acceptable" when it's an all-or-1 thing ... anything else can be used correctly in many places but in my opinion should really be explicit.

tibble allows all-or-1, otherwise it errors:

library(tibble)
tibble(x = 1, y = 1:3)
# # A tibble: 3 x 2
#       x     y
#   <dbl> <int>
# 1     1     1
# 2     1     2
# 3     1     3
tibble(x = 1:2, y = 1:3)
# Error: Tibble columns must have compatible sizes.
# * Size 2: Existing data.
# * Size 3: Column `y`.
# i Only values of size one are recycled.

Specific to Your Problem

You are trying to do non-standard evaluation of the symbol O3 outside of the data.table construct. I believe you are intending to take the mean of a user-provided column of the frame based on other conditions.

Here's one way to get around to doing it: pass a string, and use get(poll.name) (whereever you need the data) within the data.table to get at the data:

myfunc <- function(linked.dat, poll.name) {
  linked.dat[,
             `:=` (t1.avg = mean(get(poll.name)[AQ.Date >= Cdate & AQ.Date < Cdate + 1], na.rm = TRUE),
                   t2.avg = mean(get(poll.name)[AQ.Date >= Cdate + 1 & AQ.Date < Cdate + 2], na.rm = TRUE),
                   t3.avg = mean(get(poll.name)[AQ.Date >= Cdate + 2 & AQ.Date <= Bdate], na.rm = TRUE),
                   total.avg = mean(get(poll.name))),
             by = ID]

  linked.pollname <- linked.dat

  return(linked.pollname)
}

myfunc(df, "O3") 
#       O3    AQ.Date ID      Cdate      Bdate Total_weeks t1.avg t2.avg   t3.avg total.avg
#  1: 21.1 1980-01-28  a 1980-01-22 1980-02-08    2.428571   19.1    NaN 24.60909     24.15
#  2: 27.3 1980-01-30  a 1980-01-22 1980-02-08    2.428571   19.1    NaN 24.60909     24.15
#  3: 23.8 1980-01-31  a 1980-01-22 1980-02-08    2.428571   19.1    NaN 24.60909     24.15
#  4: 29.5 1980-02-01  a 1980-01-22 1980-02-08    2.428571   19.1    NaN 24.60909     24.15
#  5: 23.8 1980-01-29  a 1980-01-22 1980-02-08    2.428571   19.1    NaN 24.60909     24.15
# ---                                                                                      
# 18: 24.4 1980-02-10  b 1980-01-26 1980-02-14    2.714286   15.6   28.6 23.51250     23.23
# 19: 21.2 1980-01-30  b 1980-01-26 1980-02-14    2.714286   15.6   28.6 23.51250     23.23
# 20: 22.1 1980-02-11  b 1980-01-26 1980-02-14    2.714286   15.6   28.6 23.51250     23.23
# 21: 26.1 1980-02-13  b 1980-01-26 1980-02-14    2.714286   15.6   28.6 23.51250     23.23
# 22: 19.9 1980-02-14  b 1980-01-26 1980-02-14    2.714286   15.6   28.6 23.51250     23.23

Upvotes: 1

Related Questions