Gore
Gore

Reputation: 1

Cross-check for duplicate values across dataframes

I'm trying to write a function that takes as its arguments a list of dataframes and a list of criteria, then returns a list of those dataframes with columns that indicate the row on which those values are duplicated in another dataframe.

For example, I have three dataframes:

df1:

Name1 | Zip_code | Data
----- | -------- | ----
George|  123     |  abc
----- | -------- | ----
Marge |  456     |  def
----- | -------- | ----
Mike  |  789     |  foo

df2:

Name  |  data    | zip_code
----- | -------- | --------
Mike  | klm      | 789
----- | -------- | --------
George| xxx      | 123
----- | -------- | --------
Marge | yyy      | 456
----- | -------- | --------
Bob   | zzz      | 678

df3:

Data  |  Name    | zip_code
----- | -------- | --------
zzz   |  Bob     | 678
----- | -------- | --------
ggg   | Mike     | 789

Assuming I only care about which names and zip-codes are duplicated, I want the output to look like this:

df1:

Name1 |  Zip_code | Data | row_df2 | row_df3
----- | --------- | ---- | ------- | -------
George| 123       | abc  | 2       | NA
----- | --------- | ---- | ------- | -------
Marge | 456       | def  | 3       | NA
----- | --------- | ---- | ------- | -------
Mike  | 789       | foo  | 1       | 2

df2:

Name  | data  | zip_code  | row_df3
----- | ----- | --------- | -------
Mike  | klm   | 789       | 2
----- | ----- | --------- | -------
George| xxx   | 123       | NA
----- | ----- | --------- | -------
Marge | yyy   | 456       | NA
----- | ----- | --------- | -------
Bob   | zzz   | 678       | 1

The column names will not always be the same between each dataframe, e.g. we can have "Name" in one dataframe and "NameWhole" in another. Additionally, there can be differing amounts of columns in each dataframe. I realized that the order of the data that is to be compared needs to be the same from left to right per dataframe, but otherwise what is between columns does not matter. Thus,

df1 has:

Name | zip_code | data

df2 has:

data | Name | zip_code

df3 has:

Name | data | zip_code

My current solution is as follows:

First, initialize the dataframe list which is the first argument of the function:

dflist[[1]] <- df1
dflist[[2]] <- df2
dflist[[3]] <- df3

Then we initialize the the criteria list which is the second argument of the function. Since we're interested in what Names and zip_codes are common across the dataframes, this is:

criterialist[[1]] <- c(1,2)
criterialist[[2]] <- c(1,3)
criterialist[[3]] <- c(2,3)

Now the function is:

cross_checker <- function(dflist, criterialist){

# Insert an index column indicating the row number to be returned:
for (i in 2:length(dflist)){
dflist[[i]]$index <- 1:nrow(dflist[[i]])
}

# Next we loop over the dataframes with two for-loops:
for (i in 1:length(dflist)-1){
  for (j in 2:length(dflist)){
  dflist[[i]][,ncol(dflist[[i]])+1] <- merge(dflist[[i]], dflist[[j]], by.x=criterialist[[i]], by.y=criterialist[[j]], all.x=TRUE)$index
}
}

As a result I've only got one new index column into df1, and sometimes my RStudio just opens up a debug window. I'm not sure if "merge" is the way to go with this, but I haven't managed to figure out how "match" would work either.

I suppose one way would be to brute-force it with for-loops, but I think that would be very slow.

The end idea is to create a function that takes an arbitrary amount of dataframes with an arbitrary criteria by which to check for duplicated records and return those dataframes with a new column that indicates on which row and in which dataframe the record was duplicated.

Edit: Apologies, my first question. Here is reproducible code for the tables:

name1 <- c("George","Marge","Mike")
zip1 <- c(123,456,789)
data1 <- c("abc","def","foo")
df1 <- data.frame(name1,zip1,data1,stringsAsFactors = F)

name2 <- c("Mike","George","Marge","Bob")
data2 <- c("klm","xxx","yyy","zzz")
zip2 <- c(789,123,456,678)
df2 <- data.frame(name2,data2,zip2,stringsAsFactors = F)

data3 <- c("zzz", "ggg")
name3 <- c("Bob","Mike")
zip3 <- c(678,789)
df3 <- data.frame(data3,name3,zip3,stringsAsFactors = F)

Edit 2:

I decided to add one additional dataframe (so now there are 4):

    name1 <- c("George","Marge","Mike")
    zip1 <- c(123,456,789)
    data1 <- c("abc","def","foo")
    df1 <- data.frame(name1,zip1,data1,stringsAsFactors = F)

    name2 <- c("Mike","George","Marge","Bob")
    data2 <- c("klm","xxx","yyy","zzz")
    zip2 <- c(789,123,456,678)
    df2 <- data.frame(name2,data2,zip2,stringsAsFactors = F)

    data3 <- c("zzz", "ggg")
    name3 <- c("Bob","Mike")
    zip3 <- c(678,789)
    df3 <- data.frame(data3,name3,zip3,stringsAsFactors = F)

    name4<-c("Marge", "George","Bob")
    zip4<-c(234,123,678)
    data4<-c("ask","bff","hhh")
    df4 <- data.frame(name4,zip4,data4,stringsAsFactors = F)

I then decided to try the following code:

cross_checker2 <- function(dflist,criterialist){
  returnlist<-list()
looplen1 <- length(dflist)-1

 for(i in 1:looplen1){

    temp_df1 <- dflist[[i]]
    temp_crit1 <- criterialist[[i]]
    for(j in (i+1):length(dflist)){
     temp_df2 <- dflist[[j]]
 temp_crit2 <- criterialist[[j]]
   temp_df1 <- merge(temp_df1,temp_df2,by.x=temp_crit1,by.y=temp_crit2,all.x=TRUE)

    }

    returnlist[[length(returnlist)+1]]<-temp_df1
  }

I create the following lists to pass as arguments to the function:

deflista<-list()
deflista[[1]]<-df1
deflista[[2]]<-df2
deflista[[3]]<-df3
deflista[[4]]<-df4

crit1<-c(1,2)
crit2<-c(1,3)
crit3<-c(2,3)
crit4<-c(1,2)

critlist<-list()
critlist[[1]]<-crit1
critlist[[2]]<-crit2
critlist[[3]]<-crit3
critlist[[4]]<-crit4

and call it as:

test <- cross_checker2(deflista,critlist)

The output is correct for everything else besides the second dataframe. First dataframe is correct:

name1  |  zip1  | data1  | data2  | data3  | data4
-------|  ----- | -------|--------| -------|  -------
George |  123   | abc    | xxx    | <NA>   | bff
-------|  ------| -------| -------| -------| --------
Marge  | 456    | def    | yyy    | <NA>   | <NA>
------ | ------ | ------ | ------ | ------ | ------
Mike   | 789    | foo    | klm    | ggg    | <NA>

Now the second:

name2  | data2  | zip2   | data3  |  data4
------ | ------ | ------ | ------ | ------
Bob    | zzz    | 678    | zzz    | <NA>
------ | ------ | ------ | ------ | -------
George | xxx    | 123    | <NA>   | <NA>
-----  | ------ | ------ | ------ | ------
Marge  | yyy    | 456    | <NA>   | <NA>
-----  | ------ | ------ | ------ | ------
Mike   | klm    | 789    | ggg    | <NA>

Which is incorrect, since George and Bob in the last dataframe (deflista[[4]]) are there, but for some reason the merge doesn't return those.

Third dataframe:

name3  |  zip3  |  data3  |  data4
------ | ------ | ------- | ------
Bob    | 678    | zzz     | hhh
-----  | ------ | ------- | --------
Mike   | 789    | ggg     | <NA>

Which is correct, because Bob is found in the last dataframe (deflista[[4]])

I can't figure out what's wrong with the for loops, as there has to be some indexing issue when going into comparing the second dataframe in the bunch. Any ideas?

For these purposes I left out returning the row index of the found entry, but I can add it as soon as I can figure out what is wrong with it. Also, prefer any solution in base library.

Upvotes: 0

Views: 1339

Answers (2)

Gore
Gore

Reputation: 1

Thank you for the input!

And apologies, I think when i edited my original post, it removed some of the input I had received from you guys. I didn't know it would do that.

However, I managed a solution for this, one of the big ones was the merge, since I didn't realize it changes the column and row orders.

Anyhow, this works:

cross_checker4 <- function(dflist,criterialist) {
  # Initialize the output list

  returnlist <- list()


  # Initialize the outer loop length, 
  # this can be omitted in the for-loop below but let's 
  # keep it for historical reasons

  looplen1 <- length(dflist) - 1

  # Loop through all dataframes in dflist, 
  # this could just as well be for (i in 1:length(dflist)-1){}

  for (i in 1:looplen1) {
    # Initialize a temporary dataframe 
    # since we can't copy the data within dflist
    # Rearrange the columns for the output. 
    # Merge will mix them up otherwise

    temp_df1_drop <- dflist[[i]][-c(critlist[[i]])]
    temp_df1_keep <- dflist[[i]][c(critlist[[i]])]
    temp_df1 <- cbind(temp_df1_keep,temp_df1_drop)

    # Initialize the temporary criteria from criterialist

    temp_crit1 <- c(1:length(critlist[[i]]))

    # Loop through all remaining dataframes in dflist 
    # --> This is where we compare

    for (j in (i + 1):length(dflist)) {
      temp_df2 <- dflist[[j]]


      temp_df2_drop <- temp_df2[-c(critlist[[j]])]
      temp_df2_keep <- temp_df2[c(critlist[[j]])]
      temp_df2 <- cbind(temp_df2_keep,temp_df2_drop)

      # Add index column into dataframe to indicate 
      # which row the duplicate entry is on

      temp_df2$index <- 1:nrow(temp_df2)

      # Rename the index column

      indexer <- paste(c("index", j),collapse = " ")
      colnames(temp_df2)[colnames(temp_df2) == 'index'] <- indexer

      temp_crit2 <- c(1:length(critlist[[j]]))

      # Do the merge

      temp_df1 <-
        merge(
          temp_df1,temp_df2,by.x = temp_crit1,by.y = temp_crit2,all.x = TRUE
        )
    }

    # Insert merged dataframe into the returnlist

    returnlist[[length(returnlist) + 1]] <- temp_df1

  }
  # Since merge shoves in all columns in the comparison dataframe, 
  # we remove those columns and only leave index x

  for (k in 1:length(returnlist)) {
    for (o in (ncol(dflist[[k]]) + 1):(ncol(returnlist[[k]]))) {
      if (!grepl("index",names(returnlist[[k]])[o])) {
        returnlist[[k]] <- returnlist[[k]][,-o]
      }

    }
  }

  # Exit the loops and return the output list
  return(returnlist)
}

Upvotes: 0

Uwe
Uwe

Reputation: 42564

I believe, I have fixed the loops in the original question so far that they return the expected results:

# create lists
dflist <- list(df1, df2, df3)
criterialist <- list(c(1,2), c(1,3), c(2,3))

# add index columns
dflist <- lapply(dflist, function(x) {x[["index"]] <- seq_len(nrow(x)); x})

# find combinations of dataframes to check
combi <- combn(seq_along(dflist), 2)
combi
     [,1] [,2] [,3]
[1,]    1    1    2
[2,]    2    3    3
# check for matching rows
for (k in seq_len(ncol(combi))) {
  i <- combi[1, k]
  j <- combi[2, k]
  tmp <- merge(dflist[[i]], dflist[[j]], 
               by.x=criterialist[[i]], by.y=criterialist[[j]], all.x=TRUE)
  dflist[[i]][[paste0("row_df", j)]] <- tmp[order(tmp$index.x), "index.y"]
}
dflist
[[1]]
   name1 zip1 data1 index row_df2 row_df3
1 George  123   abc     1       2      NA
2  Marge  456   def     2       3      NA
3   Mike  789   foo     3       1       2

[[2]]
   name2 data2 zip2 index row_df3
1   Mike   klm  789     1       2
2 George   xxx  123     2      NA
3  Marge   yyy  456     3      NA
4    Bob   zzz  678     4       1

[[3]]
  data3 name3 zip3 index
1   zzz   Bob  678     1
2   ggg  Mike  789     2

Note that this is the expected result for checking 3 dataframes (before Edit2 of the question).

There are several flaws which caused the original code to break:

  1. The loop limit in the 2nd for loop was ill defined: for (i in 1:length(dflist)-1){. Here, the : operator takes precedence, so the indices started from 0 which caused errors. This can be fixed by an additional pair of parantheses for (i in 1:(length(dflist)-1)){ or even better by using the seq_len() function for (i in seq_len(length(dflist)-1)) {
  2. merge() returns two columns index.x and index.y. It only returns a single index column in case of merges with df1 where the OP has spared to add an index column.
  3. The result of merge() needs to be ordered by index.x before appending.
  4. The double for loop caused comparisons of dataframes with itself. Instead, the combn() function is used to find all unique combinations.

Upvotes: 0

Related Questions