user2822260
user2822260

Reputation:

Merge based on partial match of ID in R

I have two datasets I would like to merge based on a partial match of an ID that is a string of ID's separated by commas. So my first dataset has a ID field that is basic (one string) but the second has an ID string that includes multiple different IDs separated by a comma. For example:

TIME   GROUPID  y1  y2
 1      A       0    1
 1      B       1    1
 2      C       1    0
 2      D       0    0
 3      E       1    0 

TIME  GROUPID   x1
  1     A,B     4 
  2     B,C     2    
  3      E      3    

I would like to merge these two datasets based on TIME and GROUPID but for GROUPID I would like a partial match. So "A" would match "A" but it would also match "A,E" or "B,A". The result:

  TIME   GROUPID  y1  y2  x1
 1      A       0    1    4
 1      B       1    1    4
 2      C       1    0    2
 2      D       0    0    NA
 3      E       1    0    3

Thanks so much in advance!

Upvotes: 1

Views: 421

Answers (1)

Pierre L
Pierre L

Reputation: 28441

Try cSplit from splitstackshape. With the argument direction="long" we give the lookup data.frame a standard merging layout:

library(splitstackshape)
lkup <- cSplit(df2, "GROUPID", direction="long")
merge(df1, lkup, by=c("TIME", "GROUPID"), all.x=T)
#   TIME GROUPID y1 y2 x1
# 1    1       A  0  1  4
# 2    1       B  1  1  4
# 3    2       C  1  0  2
# 4    2       D  0  0 NA
# 5    3       E  1  0  3

data

df1 <- structure(list(TIME = c(1L, 1L, 2L, 2L, 3L), GROUPID = structure(1:5, .Label = c("A", 
"B", "C", "D", "E"), class = "factor"), y1 = c(0L, 1L, 1L, 0L, 
1L), y2 = c(1L, 1L, 0L, 0L, 0L)), .Names = c("TIME", "GROUPID", 
"y1", "y2"), class = "data.frame", row.names = c(NA, -5L))

df2 <- structure(list(TIME = 1:3, GROUPID = structure(1:3, .Label = c("A,B", 
"B,C", "E"), class = "factor"), x1 = c(4L, 2L, 3L)), .Names = c("TIME", 
"GROUPID", "x1"), class = "data.frame", row.names = c(NA, -3L
))

Upvotes: 4

Related Questions