user1609452
user1609452

Reputation: 4444

Match multiple columns in dataframe from different sources

I have the results of games played between multiple players at different points in time. I have this information from two different sources who assign different unique ids to each player. I would like to find an eloquent way to match up the two data sources by player id. The two data sources

sourcex <- structure(list(outcomedate = structure(c(12637, 12637, 12637, 
12637, 12637, 12637, 12637, 12637, 12637, 12637, 12637, 12638, 
12639, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 
12641, 12641, 12641, 12643, 12644, 12644, 12644, 12644, 12644, 
12644, 12644, 12644, 12644, 12644, 12645), class = "Date"), xid1 = c(206L, 
208L, 209L, 216L, 233L, 235L, 239L, 241L, 250L, 253L, 259L, 238L, 
236L, 211L, 221L, 234L, 249L, 254L, 255L, 257L, 258L, 207L, 230L, 
248L, 258L, 207L, 211L, 230L, 234L, 236L, 248L, 249L, 254L, 255L, 
257L, 221L), xid2 = c(211L, 207L, 221L, 249L, 248L, 257L, 234L, 
255L, 236L, 258L, 254L, 230L, 241L, 253L, 235L, 238L, 208L, 233L, 
239L, 259L, 206L, 209L, 250L, 216L, 259L, 216L, 241L, 208L, 235L, 
239L, 253L, 250L, 209L, 238L, 206L, 233L), outcome1 = c(2L, 1L, 
0L, 2L, 1L, 3L, 1L, 1L, 2L, 2L, 0L, 2L, 3L, 3L, 1L, 0L, 2L, 0L, 
0L, 0L, 2L, 1L, 2L, 1L, 0L, 3L, 2L, 0L, 0L, 0L, 2L, 2L, 2L, 1L, 
1L, 1L), outcome2 = c(0L, 0L, 0L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 
0L, 1L, 0L, 1L, 0L, 0L, 1L, 2L, 0L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 0L, 1L, 1L, 2L, 1L, 0L, 1L, 1L, 3L)), .Names = c("outcomedate", 
"xid1", "xid2", "outcome1", "outcome2"), row.names = c(NA, 36L
), class = "data.frame")

sourcey <- structure(list(outcomedate = structure(c(12637, 12637, 12637, 
12637, 12637, 12637, 12637, 12637, 12637, 12637, 12637, 12638, 
12639, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 
12641, 12641, 12641, 12643, 12644, 12644, 12644, 12644, 12644, 
12644, 12644, 12644, 12644, 12644, 12645), class = "Date"), yid1 = c(56, 
46, 67, 68, 59, 63, 55, 50, 66, 61, 57, 58, 53, 60, 64, 48, 69, 
54, 51, 65, 62, 47, 49, 52, 64, 60, 47, 48, 69, 49, 54, 51, 65, 
53, 52, 62), yid2 = c(47, 51, 64, 48, 62, 69, 53, 54, 60, 49, 
65, 52, 50, 63, 57, 56, 61, 46, 58, 67, 66, 59, 68, 55, 63, 57, 
68, 55, 59, 67, 58, 66, 50, 46, 56, 61), outcome1 = structure(c(1L, 
1L, 2L, 2L, 3L, 3L, 2L, 1L, 4L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 3L, 
3L, 3L, 4L, 1L, 1L, 1L, 2L, 3L, 1L, 4L, 2L, 2L, 2L, 1L, 3L, 2L, 
3L, 3L, 1L), .Label = c("1", "2", "0", "3", "4", "5", "6"), class = "factor"), 
    outcome2 = structure(c(1L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 3L, 
    2L, 1L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L, 3L, 
    2L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("0", 
    "1", "2", "3", "4"), class = "factor")), .Names = c("outcomedate", 
"yid1", "yid2", "outcome1", "outcome2"), row.names = c(NA, 36L
), class = "data.frame")

Both sources have an outcomedate, outcome1, outcome2 in common. They assign different ids to the individual players in the game. I have done the following to find the match between ids.

sourcex$ID <- with(sourcex, paste0(outcomedate, outcome1, outcome2))
sourcey$ID <- with(sourcey, paste0(outcomedate, outcome1, outcome2))

uPlayersx <- with(sourcex, unique(c(xid1, xid2)))
uPlayersy <- with(sourcey, unique(c(yid1, yid2)))

comparex <- sapply(uPlayersx, function(x){
  paste0(with(sourcex, ID[xid1 == x| xid2 == x]), collapse = '~')
})

comparey <- sapply(uPlayersy, function(x){
  paste0(with(sourcey, ID[yid1 == x| yid2 == x]), collapse = '~')
})

dumMatch <- data.frame(xid = uPlayersx, yid = uPlayersy[match(comparex, comparey)])

It works ok here on this test dataset however the real application is larger and this feels like a hack. Also the real datasets may have errors in reporting etc so partial matches might be needed. Any help would be appreciated.

Upvotes: 2

Views: 145

Answers (1)

Ferdinand.kraft
Ferdinand.kraft

Reputation: 12819

This will (at least) help filtering out days that match perfectly:

match.day <- function(d)
{
    tempx <- sourcex[sourcex$outcomedate==d,]
    tempy <- sourcey[sourcey$outcomedate==d,]

    if(nrow(tempx)!=nrow(tempy)) stop("matching failed: number of rows differ.")

    P <- outer(tempx$outcome1, tempy$outcome1, `==`) &
         outer(tempx$outcome2, tempy$outcome2, `==`)

    if(any(rowSums(P)!=1)) stop("maching failed: ambiguous or impossible assignment.")

    map <- P %*% seq_len(nrow(tempy))

    cbind(tempx[,c("xid1","xid2")], tempy[map,c("yid1","yid2")])
}

days <- unique(c(sourcex$outcomedate, sourcey$outcomedate))

do.call(rbind, lapply(days[-c(1,4,7)], match.day))

Note that it failed for days 1, 4 and 7 (see days[c(1,4,7)]).

Result for other days:

   xid1 xid2 yid1 yid2
12  238  230   58   52
13  236  241   53   50
22  207  209   47   59
23  230  250   52   55
24  248  216   49   68
25  258  259   64   63
36  221  233   62   61

Upvotes: 1

Related Questions