Reputation: 5059
I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1"
and "var2"
in the code below). If the data.frames were identical in terms of columns, I could merely rbind
, for which plyr's rbind.fill would do the job, but that's not the case with these data.
Because the merge
command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:
merge.rec <- function(.list, ...){
if(length(.list)==1) return(.list[[1]])
Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}
And I would call the function like so:
df <- merge.rec(my.list, by.x = c("var1", "var2"),
by.y = c("var1", "var2"), all = T, suffixes=c("", ""))
But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:
Error in match.names(clabs, names(xi)) :
names do not match previous names
(Incidently, I see other references to this error elsewhere with no resolution).
Is there any way to solve this?
Upvotes: 353
Views: 294131
Reputation: 11419
Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:
x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)
The answer is divided in three sections representing three different ways to perform the merge. You probably want to use the purrr
way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.
1) Join them with reduce
from the purrr
package:
The purrr
package provides a reduce
function which has a concise syntax:
library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
# A tibble: 3 x 4
# i j k l
# <chr> <int> <int> <int>
# 1 a 1 NA 9
# 2 b 2 4 NA
# 3 c 3 5 7
You can also perform other joins, such as a full_join
or inner_join
:
list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i j k l
# <chr> <int> <int> <int>
# 1 a 1 NA 9
# 2 b 2 4 NA
# 3 c 3 5 7
# 4 d NA 6 8
list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i j k l
# <chr> <int> <int> <int>
# 1 c 3 5 7
2) dplyr::left_join()
with base R Reduce()
:
list(x,y,z) %>%
Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)
# i j k l
# 1 a 1 NA 9
# 2 b 2 4 NA
# 3 c 3 5 7
3) Base R merge()
with base R Reduce()
:
And for comparison purposes, here is a base R version of the left join based on Charles's answer.
Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
list(x,y,z))
# i j k l
# 1 a 1 NA 9
# 2 b 2 4 NA
# 3 c 3 5 7
Upvotes: 337
Reputation: 47340
We can use {powerjoin}.
Borrowing sample data from accepted answer:
x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)
library(powerjoin)
power_full_join(list(x,y,z), by = "i")
#> i j k l
#> 1 a 1 NA 9
#> 2 b 2 4 NA
#> 3 c 3 5 7
#> 4 d NA 6 8
power_left_join(list(x,y,z), by = "i")
#> i j k l
#> 1 a 1 NA 9
#> 2 b 2 4 NA
#> 3 c 3 5 7
You might also start with a dataframe and join a list of data frames, for the same result
power_full_join(x, list(y,z), by = "i")
#> i j k l
#> 1 a 1 NA 9
#> 2 b 2 4 NA
#> 3 c 3 5 7
#> 4 d NA 6 8
Upvotes: 8
Reputation: 1663
Here is a generic wrapper which can be used to convert a binary function to multi-parameters function. The benefit of this solution is that it is very generic and can be applied to any binary functions. You just need to do it once and then you can apply it any where.
To demo the idea, I use simple recursion to implement. It can be of course implemented with more elegant way that benefits from R's good support for functional paradigm.
fold_left <- function(f) {
return(function(...) {
args <- list(...)
return(function(...){
iter <- function(result,rest) {
if (length(rest) == 0) {
return(result)
} else {
return(iter(f(result, rest[[1]], ...), rest[-1]))
}
}
return(iter(args[[1]], args[-1]))
})
})}
Then you can simply wrap any binary functions with it and call with positional parameters (usually data.frames) in the first parentheses and named parameters in the second parentheses (such as by =
or suffix =
). If no named parameters, leave second parentheses empty.
merge_all <- fold_left(merge)
merge_all(df1, df2, df3, df4, df5)(by.x = c("var1", "var2"), by.y = c("var1", "var2"))
left_join_all <- fold_left(left_join)
left_join_all(df1, df2, df3, df4, df5)(c("var1", "var2"))
left_join_all(df1, df2, df3, df4, df5)()
Upvotes: 1
Reputation: 946
When you have a list of dfs, and a column contains the "ID", but in some lists, some IDs are missing, then you may use this version of Reduce / Merge in order to join multiple Dfs of missing Row Ids or labels:
Reduce(function(x, y) merge(x=x, y=y, by="V1", all.x=T, all.y=T), list_of_dfs)
Upvotes: 0
Reputation: 946
I had a list of dataframes with no common id column.
I had missing data on many dfs. There were Null values.
The dataframes were produced using table function.
The Reduce, Merging, rbind, rbind.fill, and their like could not help me to my aim.
My aim was to produce an understandable merged dataframe, irrelevant of the missing data and common id column.
Therefore, I made the following function. Maybe this function can help someone.
##########################################################
#### Dependencies #####
##########################################################
# Depends on Base R only
##########################################################
#### Example DF #####
##########################################################
# Example df
ex_df <- cbind(c( seq(1, 10, 1), rep("NA", 0), seq(1,10, 1) ),
c( seq(1, 7, 1), rep("NA", 3), seq(1, 12, 1) ),
c( seq(1, 3, 1), rep("NA", 7), seq(1, 5, 1), rep("NA", 5) ))
# Making colnames and rownames
colnames(ex_df) <- 1:dim(ex_df)[2]
rownames(ex_df) <- 1:dim(ex_df)[1]
# Making an unequal list of dfs,
# without a common id column
list_of_df <- apply(ex_df=="NA", 2, ( table) )
it is following the function
##########################################################
#### The function #####
##########################################################
# The function to rbind it
rbind_null_df_lists <- function ( list_of_dfs ) {
length_df <- do.call(rbind, (lapply( list_of_dfs, function(x) length(x))))
max_no <- max(length_df[,1])
max_df <- length_df[max(length_df),]
name_df <- names(length_df[length_df== max_no,][1])
names_list <- names(list_of_dfs[ name_df][[1]])
df_dfs <- list()
for (i in 1:max_no ) {
df_dfs[[i]] <- do.call(rbind, lapply(1:length(list_of_dfs), function(x) list_of_dfs[[x]][i]))
}
df_cbind <- do.call( cbind, df_dfs )
rownames( df_cbind ) <- rownames (length_df)
colnames( df_cbind ) <- names_list
df_cbind
}
Running the example
##########################################################
#### Running the example #####
##########################################################
rbind_null_df_lists ( list_of_df )
Upvotes: 1
Reputation: 3055
I will reuse the data example from @PaulRougieux
x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)
Here's a short and sweet solution using purrr
and tidyr
library(tidyverse)
list(x, y, z) %>%
map_df(gather, key=key, value=value, -i) %>%
spread(key, value)
Upvotes: 5
Reputation: 4469
Reduce makes this fairly easy:
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
Here's a fully example using some mock data:
set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
# x a b y
#12 12 NA 18 NA
#13 13 NA 19 NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352
And here's an example using these data to replicate my.list
:
merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]
# matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1 ALGIERE 200 RI 026 S NA <NA> NA NA NA NA <NA>
#2 ALVES 100 RI 019 S NA <NA> NA NA NA NA <NA>
#3 BADEAU 100 RI 032 S NA <NA> NA NA NA NA <NA>
Note: It looks like this is arguably a bug in merge
. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame
which does make.unique
the names, causing the rbind
to fail.
# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname party st district chamber senate1993 name.x
# [8] votes.year.x senate1994 name.y votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname party st district chamber senate1993 name.x
# [8] votes.year.x senate1994 name.y votes.year.y senate1995 name votes.year
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.
Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge
. Eg:
my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))
The merge
/Reduce
will then work fine.
Upvotes: 245
Reputation: 55735
You can do it using merge_all
in the reshape
package. You can pass parameters to merge
using the ...
argument
reshape::merge_all(list_of_dataframes, ...)
Here is an excellent resource on different methods to merge data frames.
Upvotes: 60
Reputation: 36597
You can use recursion to do this. I haven't verified the following, but it should give you the right idea:
MergeListOfDf = function( data , ... )
{
if ( length( data ) == 2 )
{
return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
}
return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}
Upvotes: 6