Laura K
Laura K

Reputation: 207

Matching dataframes with data.table

I need to fill a matrix (MA) with information from a long data frame (DF) using another matrix as identifier (ID.MA).

An idea of my three matrices: MA.ID creates an identifier to look in the big DF the needed variables:

     a      b      c
a    ID.aa  ID.ab  ID.ac
b    ID.ba  ID.bb  ID.bc
c    ID.ca  ID.cb  ID.cc

The original big data frame has useless information but has also the rows that are useful for me to fill the target MA matrix:

ID     1990 1991 1992
ID.aa  10   11   12
ID.ab  13   14   15
ID.ac  16   17   18
ID.ba  19   20   21
ID.bb  22   23   24
ID.bc  25   26   27
ID.ca  28   29   30
ID.cb  31   32   33
ID.cc  34   35   36
ID.xx  40   40   55
ID.xy  50   51   45
....

MA should be filled with cross-information. In my example it should look like that for a chosen column of DF (let's say, 1990):

     a    b    c
a    10   13   16
b    19   22   25
c    28   31   34

I've tried to use match but honestly it didn't work out:

MA$a = DF[match(MA.ID$a, DF$ID),2]

I was recommended to use the data.table package but I couldn't see how that would help me.

Anyone has any good way to approach this problem?

Upvotes: 3

Views: 322

Answers (2)

Jaap
Jaap

Reputation: 83215

Supposing that your input are dataframes, then you could do the following:

library(data.table)
setDT(ma)[, lapply(.SD, function(x) x = unlist(df[match(x,df$ID), "1990"]))
          , .SDcols = colnames(ma)]

which returns:

    a  b  c
1: 10 13 16
2: 19 22 25
3: 28 31 34

Explanation:

  • With setDT(ma) you transform the dataframe into a datatable (which is an enhanced dataframe).
  • With .SDcols=colnames(ma) you specify on which columns the transformation has to be applied.
  • lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"])) performs the matching operation on each column specified with .SDcols.

An alternative approach with data.table is first transforming ma to a long data.table:

ma2 <- melt(setDT(ma), measure.vars = c("a","b","c"))
setkey(ma2, value)    # set key by which 'ma' has to be indexed
setDT(df, key="ID")   # transform to a datatable & set key by which 'df' has to be indexed

# joining the values of the 1990 column of df into
# the right place in the value column of 'ma'
ma2[df, value :=  `1990`]

which gives:

> ma2
   variable value
1:        a    10
2:        b    13
3:        c    16
4:        a    19
5:        b    22
6:        c    25
7:        a    28
8:        b    31
9:        c    34

The only drawback of this method is that the numeric values in the 'value' column get stored as character values. You can correct this by extending it as follows:

ma2[df, value :=  `1990`][, value := as.numeric(value)]

If you want to change it back to wide format you could use the rowid function within dcast:

ma3 <- dcast(ma2, rowid(variable) ~ variable, value.var = "value")[, variable := NULL]

which gives:

> ma3
    a  b  c
1: 10 13 16
2: 19 22 25
3: 28 31 34

Used data:

ma <- structure(list(a = structure(1:3, .Label = c("ID.aa", "ID.ba", "ID.ca"), class = "factor"), 
                     b = structure(1:3, .Label = c("ID.ab", "ID.bb", "ID.cb"), class = "factor"), 
                     c = structure(1:3, .Label = c("ID.ac", "ID.bc", "ID.cc"), class = "factor")), 
                .Names = c("a", "b", "c"), class = "data.frame", row.names = c(NA, -3L))

df <- structure(list(ID = structure(1:9, .Label = c("ID.aa", "ID.ab", "ID.ac", "ID.ba", "ID.bb", "ID.bc", "ID.ca", "ID.cb", "ID.cc"), class = "factor"), 
                     `1990` = c(10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L), 
                     `1991` = c(11L, 14L, 17L, 20L, 23L, 26L, 29L, 32L, 35L), 
                     `1992` = c(12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 36L)), 
                .Names = c("ID", "1990", "1991", "1992"), class = "data.frame", row.names = c(NA, -9L))

Upvotes: 2

Jealie
Jealie

Reputation: 6267

In base R, it can be seen as a job for outer:

> outer(1:nrow(MA.ID), 1:ncol(MA.ID), Vectorize(function(x,y) {DF[which(DF$ID==MA.ID[x,y]),'1990']}))
     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   19   22   25
[3,]   28   31   34

Explanations:

  • outer creates a matrix as the outer product of the first argument X (here a b c) and the second argument Y (here the same, a b c)
  • for every combination of X and Y, it applies a function that looks up the value in DF, at the row where the ID is MA.ID[X,Y], and at the column 1990
  • the important trick here is to wrap the function with Vectorize, because outer expects a vectorized function
  • the result is finally returned as matrix

Alternatively, another way to do it (still in base R) is:

  1. to convert your data frame MA.ID into a vector
  2. sapply a quick function that looks up correspondance with DF$ID
  3. and convert back to a matrix.

This works:

> structure(
     sapply(unlist(MA.ID), 
            function(id){DF[which(DF$ID==id),'1990']}),
     dim=dim(MA.ID), names=NULL)

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   19   22   25
[3,]   28   31   34

(here the call to structure(..., dim=dim(MA.ID), names=NULL) converts back the vector to a matrix)

Upvotes: 2

Related Questions