scottsmith
scottsmith

Reputation: 401

Merge many pairs of data together into separate data sets

I have about 100 dataset pairs that need to be merged into single data sets, I have looked at posts that show how to merge many data sets together at once (e.g., here and here), but my issue is unique. My real-world data are stored on my hard drive and are similarly named (e.g., household2010, household2011, household2012 and person2010, person2011, person2012. They don't need to be loaded into the global environment.Example below.

Working data:

library(tidyverse)

household2010 <- tribble(
~id, ~var2, ~var3, ~var4, ~var5,
"1",   "1",   "1",   "a",   "d",
"2",   "2",   "2",   "b",   "e",
"3",   "3",   "3",   "c",   "f"
)

person2010 <- tribble(
~id, ~var6, ~var7,
"1",   "1",   "1",
"2",   "2",   "2",
"3",   "3",   "3",
"4",   "4",   "4"
)

household2011 <- tribble(
~id, ~var8, ~var9, ~var10,
"1",   "1",   "1",    "1",
"2",   "2",   "2",    "2",
"3",   "3",   "3",    "3",
"4",   "4",   "4",    "4"
)

person2011 <- tribble(
~id, ~var11, ~var12, ~var13,
"1",   "1",   "1",    "1",
"2",   "2",   "2",    "2",
"3",   "3",   "3",    "3",
"4",   "4",   "4",    "4",
"5",   "5",   "5",    "5"
 )

I need to merge household2010 with person2010 and create a new dataset called hhperson2010. I need to do this to household2011 and person2011 too. Individually I could do:

hhperson2010 <- left_join(household2010, person2010, by = "id")

hhperson2011 <- left_join(household2011, person2011, by = "id")

This gets clunky when I have over a 100 data pairs. Can I use lapply to have it go through a list of the data sets and merge? Something like:

dflist1 <- list(household2010, household2011)

dflist2 <- list(person2011,    person2011)

lapply(function(x) left_join(dflist, dflist2, by = "id")

Upvotes: 0

Views: 130

Answers (3)

Vishal786btc
Vishal786btc

Reputation: 438

just an alternate solution:
years <- c("2010", "2011", "2014") 

for (x in years){
  result <- merge(get(paste0("household", x)), get(paste0("person", x)), "id")
  names <- paste0("household", x)
  print(names)
  print(result)
}

you can make a choice between loop or lapply depending on your further processing if any. In case, you don't have any more to do with your dataset, I think lapply would just solve the purpose.

Upvotes: 0

dmi3kno
dmi3kno

Reputation: 3055

Here's how I would do it with tidyverse and list-columns

library(dplyr)
library(tidyr)
library(purrr)

env2listcol <- function(rdata_file)
{
  e <- new.env()
  load(rdata_file, envir = e)
  # since you know that there's only 1 df in each environment
  as.list.environment(e)[[1]] 
}

# assuming files are stored in `input` folder
dir("input", full.names = T) %>% as_tibble() %>% 
  # split the path
  separate(value, into=c("dir", "file", "ext"), remove=FALSE) %>% 
  # get the category and the key in separate columns
  extract(file, into=c("key", "year"), regex="([a-z]+)(\\d+)") %>% 
  # file path by category by year, remove unnecessary columns
  spread(key, value) %>% select(-dir, -ext) %>% 
  # extract dataframes from environments, and join them
  mutate(household=map(household, env2listcol),
         person=map(person, env2listcol),
         joined=map2(household, person, left_join)) %>% 
  # rbind joined tables, although you could pull(joined) or imap over it
  unnest(joined)

#> # A tibble: 7 x 14
#>    year    id  var2  var3  var4  var5  var6  var7  var8  var9 var10 var11 var12 var13
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1  2010     1     1     1     a     d     1     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 2  2010     2     2     2     b     e     2     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 3  2010     3     3     3     c     f     3     3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 4  2011     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1     1     1     1     1     1
#> 5  2011     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     2     2     2     2     2     2
#> 6  2011     3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     3     3     3     3     3     3
#> 7  2011     4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     4     4     4     4     4     4

You decide what you want to do with it. You may write it back to R objects (please, please, please... use Rds instead). You may write it back into a single table (I believe it is MUCH easier to handle going forward). You can probably even export it as json.

Upvotes: 0

pogibas
pogibas

Reputation: 28369

Maybe something like this:

years <- 2010:2011
result <- lapply(years, 
              function(x) left_join(get(paste0("household", x)), 
                                    get(paste0("person", x)), 
                                    "id"))

names(result) <- paste0("household", years)

Upvotes: 1

Related Questions