Reputation: 369
I have multiple dataframes with varying column numbers and many thousands of rows. Every DF has Year and ISO columns. I want to merge all of them into a final DF that has a single Year and ISO columns and unique value column/s that correspond to those from the original DFs. The final output has to go back out in .xlsx format so I'd like to minimize the number of rows which are populated with mostly NA.
Here is a reproducible example:
library(ISOcodes)
df1 = data.frame(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value1 = sample(1:100, 10))
df2 = data.frame(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value2 = sample(1:100, 10))
df3 = data.frame(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value3 = sample(1:100, 10))
df4 = data.frame(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value4 = sample(1:100, 10))
df5 = data.frame(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value5 = sample(1:100, 10))
full_df = merge(df1, df2, by = c('Year', 'ISO'), all = T)
full_df = merge(full_df, df3, by = c('Year', 'ISO'), all = T)
full_df = merge(full_df, df4, by = c('Year', 'ISO'), all = T)
full_df = merge(full_df, df5, by = c('Year', 'ISO'), all = T)
I have to specify all = T
so that I don't lose data. This solution feels very clunky and I'm sure there must be more elegant ways to do it (possibly with data.table
?).
Thanks!
Upvotes: 4
Views: 11401
Reputation: 5408
Edit: Note - you can test the minimizing of NA
by forcing df5 to have the same Year and ISO values as df4. Then you would expect the final output to have a dim()
of ~ 40x7 rather than 50x7 (depending on sample).
df5$Year <- df4$Year
df5$ISO <- df4$ISO
Nested dplyr::full_join()
will minimize the NA
library(dplyr)
test_join <- full_join(df1, full_join(df2, full_join(df3, full_join(df4, df5))))
Not minimizing NA
- dplyr::bind_rows()
"Efficiently bind multiple data frames by row".
library(dplyr)
test_df <- bind_rows(df1, df2,df3,df4,df5)
Clunky, but will minimize NA
- data.table::merge()
a method for "Fast merge of two data.tables".
library(data.table)
dt1 = data.table(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value1 = sample(1:100, 10))
dt2 = data.table(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value2 = sample(1:100, 10))
dt3 = data.table(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value3 = sample(1:100, 10))
dt4 = data.table(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value4 = sample(1:100, 10))
dt5 = data.table(Year = sample(2000:2020, 10),
ISO = sample(ISO_3166_1$Alpha_2, 10),
value5 = sample(1:100, 10))
full_dt = merge(dt1, dt2, by = c('Year', 'ISO'), all = T)
full_dt = merge(full_dt, dt3, by = c('Year', 'ISO'), all = T)
full_dt = merge(full_dt, dt4, by = c('Year', 'ISO'), all = T)
full_dt = merge(full_dt, dt5, by = c('Year', 'ISO'), all = T)
Upvotes: 2
Reputation: 8880
l <- list(df1, df2, df3, df4, df5)
purrr::reduce(.x = l, merge, by = c('Year', 'ISO'), all = T)
Year ISO value1 value2 value3 value4 value5
1 2000 KZ NA 75 NA NA NA
2 2000 TF NA NA NA NA 34
3 2001 AD NA NA NA NA 31
4 2001 ML NA NA 87 NA NA
5 2001 NF 8 NA NA NA NA
6 2002 CC NA NA NA 12 NA
7 2002 NF NA 63 NA NA NA
8 2002 SD 42 NA NA NA NA
9 2002 SY NA NA NA NA 45
10 2003 AW NA NA 41 NA NA
11 2003 BR NA NA NA 42 NA
12 2003 TT 96 NA NA NA NA
13 2004 KE NA 100 NA NA NA
14 2005 CG NA NA NA NA 67
15 2006 BV NA NA 67 NA NA
16 2006 BW NA 9 NA NA NA
17 2006 GU 18 NA NA NA NA
18 2007 CG NA NA NA 81 NA
19 2007 IM NA 18 NA NA NA
20 2008 BH NA NA 100 NA NA
21 2008 MD 28 NA NA NA NA
22 2008 PY NA NA NA NA 96
23 2008 TR NA NA NA 87 NA
24 2009 SM NA NA NA 53 NA
25 2010 GF NA NA 56 NA NA
26 2010 LU NA NA NA 43 NA
27 2010 PM NA 80 NA NA NA
28 2011 MF NA NA 38 NA NA
29 2012 CF NA 56 NA NA NA
30 2012 JO NA NA NA 16 NA
31 2012 UG NA NA 63 NA NA
32 2013 BY 68 NA NA NA NA
33 2013 MT NA 13 NA NA NA
34 2013 NO NA NA NA 74 NA
35 2014 TR NA NA NA NA 98
36 2015 GM NA NA NA NA 27
37 2015 SC 71 NA NA NA NA
38 2016 MM NA NA 65 NA NA
39 2017 AX 50 NA NA NA NA
40 2017 BA NA NA 8 NA NA
41 2017 MF NA NA NA NA 20
42 2017 SH NA NA NA 6 NA
43 2018 GU NA 46 NA NA NA
44 2018 LR NA NA NA NA 56
45 2018 MK 33 NA NA NA NA
46 2018 SN NA NA 82 NA NA
47 2019 NA 26 NA NA NA NA
48 2019 TW NA NA NA 51 NA
49 2020 AQ NA 28 NA NA NA
50 2020 FO NA NA NA NA 81
Upvotes: 3
Reputation: 11548
Does this work:
mylist <- mget(paste0('df',1:5))
Reduce(function(x, y) merge(x,y, all= TRUE), mylist)
Year ISO value1 value2 value3 value4 value5
1 2000 PK NA NA NA 56 NA
2 2000 SX 28 NA NA NA NA
3 2001 AZ NA NA NA 45 NA
4 2001 BY NA NA NA NA 93
5 2001 GQ NA 68 NA NA NA
6 2001 KW NA NA 8 NA NA
7 2002 AL 17 NA NA NA NA
8 2002 LV NA 93 NA NA NA
9 2002 TK NA NA NA NA 60
10 2003 MP 12 NA NA NA NA
11 2003 NC NA NA NA 68 NA
12 2004 AF 90 NA NA NA NA
13 2004 AQ NA 31 NA NA NA
14 2004 PG NA NA NA NA 45
15 2005 BH NA 34 NA NA NA
16 2005 DZ NA NA NA NA 83
17 2006 PS NA NA NA 98 NA
18 2006 TK 98 NA NA NA NA
19 2006 TL NA NA 16 NA NA
20 2007 HT 33 NA NA NA NA
21 2007 PN NA NA NA NA 70
22 2007 VE NA 60 NA NA NA
23 2008 BO NA NA NA 52 NA
24 2008 TW NA NA NA NA 64
25 2009 LR 30 NA NA NA NA
26 2009 MH NA NA NA 60 NA
27 2009 TR NA NA 15 NA NA
28 2010 FO NA 12 NA NA NA
29 2010 LY NA NA NA 11 NA
30 2010 ML 89 NA NA NA NA
31 2011 GW NA NA 40 NA NA
32 2012 BY NA NA 83 NA NA
33 2012 HK NA NA NA NA 68
34 2012 PM NA NA NA 72 NA
35 2014 ET NA NA NA NA 24
36 2014 MP NA NA 64 NA NA
37 2014 WS NA 20 NA NA NA
38 2016 GT NA NA NA NA 51
39 2016 JM NA 97 NA NA NA
40 2016 UY 42 NA NA NA NA
41 2017 CK NA NA NA 64 NA
42 2017 WF NA NA 76 NA NA
43 2018 FK NA NA NA 97 NA
44 2018 PK NA NA 26 NA NA
45 2019 AM NA 41 NA NA NA
46 2019 CZ 75 NA NA NA NA
47 2019 LS NA NA NA NA 46
48 2019 TV NA NA 10 NA NA
49 2020 FM NA NA 79 NA NA
50 2020 MR NA 65 NA NA NA
Upvotes: 2