tchakravarty
tchakravarty

Reputation: 10954

R: data.table cross-join not working

I have two data.tables that I want to join (form a Cartesian product of). One of the data.tables is keyed on a Date vector, and the other on a numeric vector:

# data.table with dates (as numeric)
dtDates2 = data.table(date = 
                       as.numeric(seq(from = as.Date('2014/01/01'), 
                           to = as.Date('2014/07/01'), by = 'weeks')),
                     data1 = rnorm(26))

# data.table with dates
dtDates1 = data.table(date = 
                        seq(from = as.Date('2014/01/01'), 
                            to = as.Date('2014/07/01'), by = 'weeks'),
                      data1 = rnorm(26))


# data.table with customer IDs
dtCustomers = data.table(customerID = seq(1, 100),
                      data2 = rnorm(100))

I setkey and try to cross-join them using CJ:

# cross join the two datatables
setkey(dtCustomers, customerID)
setkey(dtDates1, date)
setkey(dtDates2, date)

CJ(dtCustomers, dtDates1)
CJ(dtCustomers, dtDates2)

but get the following error:

Error in FUN(X[[1L]], ...) : 
  Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.

Not sure what I am doing wrong.

Upvotes: 7

Views: 4398

Answers (2)

jangorecki
jangorecki

Reputation: 16697

There is no cross join functionality available in data.table out of the box.
Yet there is CJ.dt function (a CJ like but designed for data.tables) to achieve cartesian product (cross join) available in optiRum package (available in CRAN).
You can create the function:

CJ.dt = function(X,Y) {
  stopifnot(is.data.table(X),is.data.table(Y))
  k = NULL
  X = X[, c(k=1, .SD)]
  setkey(X, k)
  Y = Y[, c(k=1, .SD)]
  setkey(Y, NULL)
  X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dtCustomers, dtDates1)
CJ.dt(dtCustomers, dtDates2)

Yet there is a FR for convenience way to perform cross join filled in data.table#1717, so you could check there if there is a nicer api for cross join.

Upvotes: 23

Fabian Gehring
Fabian Gehring

Reputation: 1173

thank you jangorecki for the very useful function

I had to add support for empty X and/or Y:

CJ.dt = function(X, Y) {
  stopifnot(is.data.table(X), is.data.table(Y))

  if(nrow(X) > 0 & nrow(Y) > 0){
    k = NULL
    X = X[, c(k = 1, .SD)]
    setkey(X, k)
    Y = Y[, c(k = 1, .SD)]
    setkey(Y, NULL)
    return(X[Y, allow.cartesian = T][, k := NULL][])

  } else {
    duplicatedNames <- names(Y)[names(Y) %in% names(X)]
    if(length(duplicatedNames) > 0) {
       setnames(Y, duplicatedNames, paste0("i.", duplicatedNames))
    }
    setkey(Y)
    setkey(X)
    return(cbind(X[!X], Y[!Y]))
  }

}

# X <- data.table(a = c(1, 2))
# Y <- data.table(a = c(2, 3), b = c(4, 5))
#
# CJ.dt(X, Y)

# CJ.dt(X[a > 2], Y)

Upvotes: 3

Related Questions