Mike Flynn
Mike Flynn

Reputation: 1027

Joining data.table seems to be replacing values with NA's

I am trying to join two data.tables the same way I would join data frames with merge() with all=TRUE, like so:

> x = data.frame(index = 1:10)
> y = data.frame(index = c(2,4,6), weight = c(0.2, 0.3, 0.5))
> x
   index
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
> y
  index weight
1     2    0.2
2     4    0.3
3     6    0.5
> merge(x,y, all=TRUE)
   index weight
1      1     NA
2      2    0.2
3      3     NA
4      4    0.3
5      5     NA
6      6    0.5
7      7     NA
8      8     NA
9      9     NA
10    10     NA

I have two data tables, matched.weights and casted, which was the result of a cast function. Their keys are both "index".

> matched.weights
      index
 [1,]     1
 [2,]     2
 [3,]     3
 [4,]     4
 [5,]     5
 [6,]     6
 [7,]     7
 [8,]     8
 [9,]     9
[10,]    10
First 10 rows of 14708 printed. 
> casted
      index            1            2
 [1,] 10002 0.0049024578 3.057794e-03
 [2,] 10129 0.0011064567 8.481497e-04
 [3,] 10171 0.0012964050 5.540875e-03
 [4,] 10177 0.0008181397 6.981567e-04
 [5,] 10187 0.0060350971 2.156695e-03
 [6,]  1022 0.0011320117 8.665125e-04
 [7,] 10250 0.0047892924 7.725221e-03
 [8,] 10262 0.0044724898 8.787161e-06
 [9,] 10286 0.0016650838 7.162887e-05
[10,] 10299 0.0015840791 5.578903e-03
First 10 rows of 305 printed. 

I get this error when I try to merge them:

> merge(matched.weights, casted, all=TRUE)
Error in setcolorder(dt, c(setdiff(names(dt), end), end)) : 
  neworder is length 5 but x has 3 columns.

And when I use the data.table syntax it replaces the data entries with NA's:

> merged = casted[matched.weights]
> merged
      index X1 X2
 [1,]     1 NA NA
 [2,]     2 NA NA
 [3,]     3 NA NA
 [4,]     4 NA NA
 [5,]     5 NA NA
 [6,]     6 NA NA
 [7,]     7 NA NA
 [8,]     8 NA NA
 [9,]     9 NA NA
[10,]    10 NA NA
First 10 rows of 14708 printed. 
> merged[10002]
     index X1 X2
[1,] 10002 NA NA
> sum(merged[[2]], na.rm=T)
[1] 0

Can anyone explain to me what is going on?

Upvotes: 1

Views: 414

Answers (1)

Andrie
Andrie

Reputation: 179428

If you want to merge two data tables (in the sense of a full outer join, similar to all=TRUE) you need to use merge.data.table:

library(data.table)
dtx <- data.table(x, key="index")
dty <- data.table(y, key="index")
merge(dtx,dty, all=TRUE)
   index weight
1      1     NA
2      2    0.2
3      3     NA
4      4    0.3
5      5     NA
6      6    0.5
7      7     NA
8      8     NA
9      9     NA
10    10     NA

This is the same result you get as when using a data frame (but it is much faster).

It's worth noting that some of the merge functionality was improved in recent versions of data.table, so be sure to get the latest stable version. I am using version 1.8.0.

Upvotes: 1

Related Questions