Nick Knauer
Nick Knauer

Reputation: 4243

Seeing if all values in one dataframe row exist in another dataframe

I have a dataframe as follows:

df1    

ColA     ColB     ColC     ColD
  10        A        B        L
  11        N        Q       NA
  12        P        J        L
  43        M        T       NA
  89        O        J        T

df2

ATTR      Att      R1   R2    R3    R4
   1       45       A    B    NA    NA
   2       40       C    D    NA    NA
   3       33       T    J     O    NA
   4       65       L   NA    NA    NA
   5       20       P    L     J    NA
   6       23       Q   NA    NA    NA
   7       38       Q    L    NA    NA

How do I match up df2 with df1 so that if ALL the values in each df2 row (disregarding the order) show up in the df1 rows, then it will populate. So it is checking if ALL not just one value from each df2 row matches up with each df1 row. The final result in this case should be this:

ColA     ColB     ColC     ColD   ATTR      Att      R1   R2    R3    R4
  10        A        B        L      1       45       A    B    NA    NA
  10        A        B        L      4       65       L   NA    NA    NA
  11        N        Q       NA      6       23       Q   NA    NA    NA
  12        P        J        L      4       65       L   NA    NA    NA
  12        P        J        L      5       20       P    L     J    NA    
  89        O        J        T      3       33       T    J     O    NA

Thanks

Upvotes: 0

Views: 607

Answers (1)

Sotos
Sotos

Reputation: 51582

Here is a possible solution using base R.

Make sure everything is a character before continuing, i.e.

df[-1] <- lapply(df[-1], as.character)
df1[-c(1:2)] <- lapply(df1[-c(1:2)], as.character)

First we create two lists which contain vectors of the rowwise elements of each data frame. We then create a matrix with the length of elements from l2 are found in l1, If the length is 0 then it means they match. i.e,

l1 <- lapply(split(df[-1], seq(nrow(df))), function(i) i[!is.na(i)])
l2 <- lapply(split(df1[-c(1:2)], seq(nrow(df1))), function(i) i[!is.na(i)])

m1 <- sapply(l1, function(i) sapply(l2, function(j) length(setdiff(j, i))))
m1
#  1 2 3 4 5
#1 0 2 2 2 2
#2 2 2 2 2 2
#3 3 3 2 2 0
#4 0 1 0 1 1
#5 2 3 0 3 2
#6 1 0 1 1 1
#7 1 1 1 2 2

We then use that matrix to create a couple of coloumns in our original df. The first column rpt will indicate how many times each row has length 0 and use that as a number of repeats for each row. We also use it to filter out all the 0 lengths (i.e. the rows that do not have a match with df1). After expanding the data frame we create another variable; ATTR (same name as ATTR in df1) in order to use it for a merge. i.e.

df$rpt <- colSums(m1 == 0)
df <- df[df$rpt != 0,]
df <- df[rep(row.names(df), df$rpt),]
df$ATTR <- which(m1 == 0, arr.ind = TRUE)[,1]
df
#    ColA ColB ColC ColD rpt ATTR
#1     10    A    B    L   2    1
#1.1   10    A    B    L   2    4
#2     11    N    Q <NA>   1    6
#3     12    P    J    L   2    4
#3.1   12    P    J    L   2    5
#5     89    O    J    T   1    3

We then merge and order the two data frames,

final_df <- merge(df, df1, by = 'ATTR')

final_df[order(final_df$ColA),]
#  ATTR ColA ColB ColC ColD rpt Att R1   R2   R3   R4
#1    1   10    A    B    L   2  45  A    B <NA> <NA>
#3    4   10    A    B    L   2  65  L <NA> <NA> <NA>
#6    6   11    N    Q <NA>   1  23  Q <NA> <NA> <NA>
#4    4   12    P    J    L   2  65  L <NA> <NA> <NA>
#5    5   12    P    J    L   2  20  P    L    J <NA>
#2    3   89    O    J    T   1  33  T    J    O <NA>

DATA

dput(df)
structure(list(ColA = c(10L, 11L, 12L, 43L, 89L), ColB = c("A", 
"N", "P", "M", "O"), ColC = c("B", "Q", "J", "T", "J"), ColD = c("L", 
NA, "L", NA, "T")), .Names = c("ColA", "ColB", "ColC", "ColD"
), row.names = c(NA, -5L), class = "data.frame")

dput(df1)
structure(list(ATTR = 1:7, Att = c(45L, 40L, 33L, 65L, 20L, 23L, 
38L), R1 = c("A", "C", "T", "L", "P", "Q", "Q"), R2 = c("B", 
"D", "J", NA, "L", NA, "L"), R3 = c(NA, NA, "O", NA, "J", NA, 
NA), R4 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_)), .Names = c("ATTR", 
"Att", "R1", "R2", "R3", "R4"), row.names = c(NA, -7L), class = "data.frame")

Upvotes: 1

Related Questions