Reputation: 1
I'm trying to write a function that takes as its arguments a list of dataframes and a list of criteria, then returns a list of those dataframes with columns that indicate the row on which those values are duplicated in another dataframe.
For example, I have three dataframes:
df1:
Name1 | Zip_code | Data
----- | -------- | ----
George| 123 | abc
----- | -------- | ----
Marge | 456 | def
----- | -------- | ----
Mike | 789 | foo
df2:
Name | data | zip_code
----- | -------- | --------
Mike | klm | 789
----- | -------- | --------
George| xxx | 123
----- | -------- | --------
Marge | yyy | 456
----- | -------- | --------
Bob | zzz | 678
df3:
Data | Name | zip_code
----- | -------- | --------
zzz | Bob | 678
----- | -------- | --------
ggg | Mike | 789
Assuming I only care about which names and zip-codes are duplicated, I want the output to look like this:
df1:
Name1 | Zip_code | Data | row_df2 | row_df3
----- | --------- | ---- | ------- | -------
George| 123 | abc | 2 | NA
----- | --------- | ---- | ------- | -------
Marge | 456 | def | 3 | NA
----- | --------- | ---- | ------- | -------
Mike | 789 | foo | 1 | 2
df2:
Name | data | zip_code | row_df3
----- | ----- | --------- | -------
Mike | klm | 789 | 2
----- | ----- | --------- | -------
George| xxx | 123 | NA
----- | ----- | --------- | -------
Marge | yyy | 456 | NA
----- | ----- | --------- | -------
Bob | zzz | 678 | 1
The column names will not always be the same between each dataframe, e.g. we can have "Name" in one dataframe and "NameWhole" in another. Additionally, there can be differing amounts of columns in each dataframe. I realized that the order of the data that is to be compared needs to be the same from left to right per dataframe, but otherwise what is between columns does not matter. Thus,
df1 has:
df2 has:
df3 has:
My current solution is as follows:
First, initialize the dataframe list which is the first argument of the function:
dflist[[1]] <- df1
dflist[[2]] <- df2
dflist[[3]] <- df3
Then we initialize the the criteria list which is the second argument of the function. Since we're interested in what Names and zip_codes are common across the dataframes, this is:
criterialist[[1]] <- c(1,2)
criterialist[[2]] <- c(1,3)
criterialist[[3]] <- c(2,3)
Now the function is:
cross_checker <- function(dflist, criterialist){
# Insert an index column indicating the row number to be returned:
for (i in 2:length(dflist)){
dflist[[i]]$index <- 1:nrow(dflist[[i]])
}
# Next we loop over the dataframes with two for-loops:
for (i in 1:length(dflist)-1){
for (j in 2:length(dflist)){
dflist[[i]][,ncol(dflist[[i]])+1] <- merge(dflist[[i]], dflist[[j]], by.x=criterialist[[i]], by.y=criterialist[[j]], all.x=TRUE)$index
}
}
As a result I've only got one new index column into df1, and sometimes my RStudio just opens up a debug window. I'm not sure if "merge" is the way to go with this, but I haven't managed to figure out how "match" would work either.
I suppose one way would be to brute-force it with for-loops, but I think that would be very slow.
The end idea is to create a function that takes an arbitrary amount of dataframes with an arbitrary criteria by which to check for duplicated records and return those dataframes with a new column that indicates on which row and in which dataframe the record was duplicated.
Edit: Apologies, my first question. Here is reproducible code for the tables:
name1 <- c("George","Marge","Mike")
zip1 <- c(123,456,789)
data1 <- c("abc","def","foo")
df1 <- data.frame(name1,zip1,data1,stringsAsFactors = F)
name2 <- c("Mike","George","Marge","Bob")
data2 <- c("klm","xxx","yyy","zzz")
zip2 <- c(789,123,456,678)
df2 <- data.frame(name2,data2,zip2,stringsAsFactors = F)
data3 <- c("zzz", "ggg")
name3 <- c("Bob","Mike")
zip3 <- c(678,789)
df3 <- data.frame(data3,name3,zip3,stringsAsFactors = F)
Edit 2:
I decided to add one additional dataframe (so now there are 4):
name1 <- c("George","Marge","Mike")
zip1 <- c(123,456,789)
data1 <- c("abc","def","foo")
df1 <- data.frame(name1,zip1,data1,stringsAsFactors = F)
name2 <- c("Mike","George","Marge","Bob")
data2 <- c("klm","xxx","yyy","zzz")
zip2 <- c(789,123,456,678)
df2 <- data.frame(name2,data2,zip2,stringsAsFactors = F)
data3 <- c("zzz", "ggg")
name3 <- c("Bob","Mike")
zip3 <- c(678,789)
df3 <- data.frame(data3,name3,zip3,stringsAsFactors = F)
name4<-c("Marge", "George","Bob")
zip4<-c(234,123,678)
data4<-c("ask","bff","hhh")
df4 <- data.frame(name4,zip4,data4,stringsAsFactors = F)
I then decided to try the following code:
cross_checker2 <- function(dflist,criterialist){
returnlist<-list()
looplen1 <- length(dflist)-1
for(i in 1:looplen1){
temp_df1 <- dflist[[i]]
temp_crit1 <- criterialist[[i]]
for(j in (i+1):length(dflist)){
temp_df2 <- dflist[[j]]
temp_crit2 <- criterialist[[j]]
temp_df1 <- merge(temp_df1,temp_df2,by.x=temp_crit1,by.y=temp_crit2,all.x=TRUE)
}
returnlist[[length(returnlist)+1]]<-temp_df1
}
I create the following lists to pass as arguments to the function:
deflista<-list()
deflista[[1]]<-df1
deflista[[2]]<-df2
deflista[[3]]<-df3
deflista[[4]]<-df4
crit1<-c(1,2)
crit2<-c(1,3)
crit3<-c(2,3)
crit4<-c(1,2)
critlist<-list()
critlist[[1]]<-crit1
critlist[[2]]<-crit2
critlist[[3]]<-crit3
critlist[[4]]<-crit4
and call it as:
test <- cross_checker2(deflista,critlist)
The output is correct for everything else besides the second dataframe. First dataframe is correct:
name1 | zip1 | data1 | data2 | data3 | data4
-------| ----- | -------|--------| -------| -------
George | 123 | abc | xxx | <NA> | bff
-------| ------| -------| -------| -------| --------
Marge | 456 | def | yyy | <NA> | <NA>
------ | ------ | ------ | ------ | ------ | ------
Mike | 789 | foo | klm | ggg | <NA>
Now the second:
name2 | data2 | zip2 | data3 | data4
------ | ------ | ------ | ------ | ------
Bob | zzz | 678 | zzz | <NA>
------ | ------ | ------ | ------ | -------
George | xxx | 123 | <NA> | <NA>
----- | ------ | ------ | ------ | ------
Marge | yyy | 456 | <NA> | <NA>
----- | ------ | ------ | ------ | ------
Mike | klm | 789 | ggg | <NA>
Which is incorrect, since George and Bob in the last dataframe (deflista[[4]]) are there, but for some reason the merge doesn't return those.
Third dataframe:
name3 | zip3 | data3 | data4
------ | ------ | ------- | ------
Bob | 678 | zzz | hhh
----- | ------ | ------- | --------
Mike | 789 | ggg | <NA>
Which is correct, because Bob is found in the last dataframe (deflista[[4]])
I can't figure out what's wrong with the for loops, as there has to be some indexing issue when going into comparing the second dataframe in the bunch. Any ideas?
For these purposes I left out returning the row index of the found entry, but I can add it as soon as I can figure out what is wrong with it. Also, prefer any solution in base library.
Upvotes: 0
Views: 1339
Reputation: 1
Thank you for the input!
And apologies, I think when i edited my original post, it removed some of the input I had received from you guys. I didn't know it would do that.
However, I managed a solution for this, one of the big ones was the merge, since I didn't realize it changes the column and row orders.
Anyhow, this works:
cross_checker4 <- function(dflist,criterialist) {
# Initialize the output list
returnlist <- list()
# Initialize the outer loop length,
# this can be omitted in the for-loop below but let's
# keep it for historical reasons
looplen1 <- length(dflist) - 1
# Loop through all dataframes in dflist,
# this could just as well be for (i in 1:length(dflist)-1){}
for (i in 1:looplen1) {
# Initialize a temporary dataframe
# since we can't copy the data within dflist
# Rearrange the columns for the output.
# Merge will mix them up otherwise
temp_df1_drop <- dflist[[i]][-c(critlist[[i]])]
temp_df1_keep <- dflist[[i]][c(critlist[[i]])]
temp_df1 <- cbind(temp_df1_keep,temp_df1_drop)
# Initialize the temporary criteria from criterialist
temp_crit1 <- c(1:length(critlist[[i]]))
# Loop through all remaining dataframes in dflist
# --> This is where we compare
for (j in (i + 1):length(dflist)) {
temp_df2 <- dflist[[j]]
temp_df2_drop <- temp_df2[-c(critlist[[j]])]
temp_df2_keep <- temp_df2[c(critlist[[j]])]
temp_df2 <- cbind(temp_df2_keep,temp_df2_drop)
# Add index column into dataframe to indicate
# which row the duplicate entry is on
temp_df2$index <- 1:nrow(temp_df2)
# Rename the index column
indexer <- paste(c("index", j),collapse = " ")
colnames(temp_df2)[colnames(temp_df2) == 'index'] <- indexer
temp_crit2 <- c(1:length(critlist[[j]]))
# Do the merge
temp_df1 <-
merge(
temp_df1,temp_df2,by.x = temp_crit1,by.y = temp_crit2,all.x = TRUE
)
}
# Insert merged dataframe into the returnlist
returnlist[[length(returnlist) + 1]] <- temp_df1
}
# Since merge shoves in all columns in the comparison dataframe,
# we remove those columns and only leave index x
for (k in 1:length(returnlist)) {
for (o in (ncol(dflist[[k]]) + 1):(ncol(returnlist[[k]]))) {
if (!grepl("index",names(returnlist[[k]])[o])) {
returnlist[[k]] <- returnlist[[k]][,-o]
}
}
}
# Exit the loops and return the output list
return(returnlist)
}
Upvotes: 0
Reputation: 42564
I believe, I have fixed the loops in the original question so far that they return the expected results:
# create lists
dflist <- list(df1, df2, df3)
criterialist <- list(c(1,2), c(1,3), c(2,3))
# add index columns
dflist <- lapply(dflist, function(x) {x[["index"]] <- seq_len(nrow(x)); x})
# find combinations of dataframes to check
combi <- combn(seq_along(dflist), 2)
combi
[,1] [,2] [,3] [1,] 1 1 2 [2,] 2 3 3
# check for matching rows
for (k in seq_len(ncol(combi))) {
i <- combi[1, k]
j <- combi[2, k]
tmp <- merge(dflist[[i]], dflist[[j]],
by.x=criterialist[[i]], by.y=criterialist[[j]], all.x=TRUE)
dflist[[i]][[paste0("row_df", j)]] <- tmp[order(tmp$index.x), "index.y"]
}
dflist
[[1]] name1 zip1 data1 index row_df2 row_df3 1 George 123 abc 1 2 NA 2 Marge 456 def 2 3 NA 3 Mike 789 foo 3 1 2 [[2]] name2 data2 zip2 index row_df3 1 Mike klm 789 1 2 2 George xxx 123 2 NA 3 Marge yyy 456 3 NA 4 Bob zzz 678 4 1 [[3]] data3 name3 zip3 index 1 zzz Bob 678 1 2 ggg Mike 789 2
Note that this is the expected result for checking 3 dataframes (before Edit2 of the question).
There are several flaws which caused the original code to break:
for
loop was ill defined: for (i in 1:length(dflist)-1){
. Here, the :
operator takes precedence, so the indices started from 0
which caused errors. This can be fixed by an additional pair of parantheses for (i in 1:(length(dflist)-1)){
or even better by using the seq_len()
function for (i in seq_len(length(dflist)-1)) {
merge()
returns two columns index.x
and index.y
. It only returns a single index
column in case of merges with df1
where the OP has spared to add an index column.merge()
needs to be ordered by index.x
before appending.for
loop caused comparisons of dataframes with itself. Instead, the combn()
function is used to find all unique combinations. Upvotes: 0