Ale
Ale

Reputation: 317

Merge dataframes with some common columns and fill in others with NAs

I have around 50+ csv files that all share the same 4 columns in this order:

REG_ID  region  age age_num

and then years anything from 1990 till 2016 in this format:

REG_ID  region  age age_num y_1992 y_1993 y_1994 y_2014.15

and I was wondering what could be the best way to merge them. Going thru each to add the missing years-columns would be time consuming and likely lead to errors. The end format would be something like this:

REG_ID       region        reg_num   age   age_num  y_1991 y_1992 y_1993
BFM2         Boucle        1       c_0_4    0       770    NA     120
BFM2         Boucle        1       c_5_9    5       810    NA     11 
BFM2         Boucle        1     c_10_14   10       704    NA     130
BFM2         Boucle        1     c_15_19   15       71     NA     512
BFM2         Boucle        1     c_20_24   20       181    NA     712

Upvotes: 1

Views: 83

Answers (2)

austensen
austensen

Reputation: 3017

Here's a way you can do it using tidyverse tools. First use dir to get a vector of csv paths, then use purrr:map to read them all in, returning a list of the data frames, and then use purrr::reduce to merge all the data frames using dplyr::left_join.


library(readr)
library(purrr)
library(dplyr)

create the data sets

read_csv(
"REG_ID,region,reg_num,age,age_num,y_1991
BFM2,Boucle,1,c_0_4,0,770
BFM2,Boucle,1,c_5_9,5,810
BFM2,Boucle,1,c_10_14,10,704
BFM2,Boucle,1,c_15_19,15,71
BFM2,Boucle,1,c_20_24,20,181") %>% 
  write_csv("df_91.csv")

read_csv(
"REG_ID,region,reg_num,age,age_num,y_1992
BFM2,Boucle,1,c_0_4,0,NA
BFM2,Boucle,1,c_5_9,5,NA
BFM2,Boucle,1,c_10_14,10,NA
BFM2,Boucle,1,c_15_19,15,NA
BFM2,Boucle,1,c_20_24,20,NA") %>% 
  write_csv("df_92.csv")

read_csv(
"REG_ID,region,reg_num,age,age_num,y_1993
BFM2,Boucle,1,c_0_4,0,120
BFM2,Boucle,1,c_5_9,5,11 
BFM2,Boucle,1,c_10_14,10,130
BFM2,Boucle,1,c_15_19,15,512
BFM2,Boucle,1,c_20_24,20,712") %>% 
  write_csv("df_93.csv")

Create the final merged data set

dir(".", "\\.csv", full.names = TRUE) %>% 
  map(read_csv) %>% 
  reduce(left_join, by = c("REG_ID", "region", "reg_num", "age", "age_num"))

#> # A tibble: 5 x 8
#>   REG_ID region reg_num     age age_num y_1991 y_1992 y_1993
#>    <chr>  <chr>   <int>   <chr>   <int>  <int>  <chr>  <int>
#> 1   BFM2 Boucle       1   c_0_4       0    770   <NA>    120
#> 2   BFM2 Boucle       1   c_5_9       5    810   <NA>     11
#> 3   BFM2 Boucle       1 c_10_14      10    704   <NA>    130
#> 4   BFM2 Boucle       1 c_15_19      15     71   <NA>    512
#> 5   BFM2 Boucle       1 c_20_24      20    181   <NA>    712

Upvotes: 2

Vitalijs
Vitalijs

Reputation: 950

I think the best way would be:

library(data.table)
library(stringr)
data<-list("vector")
files_to_loop<-list.vector()[str_detect(list.vector(),".csv")]
for (i in 1:length(files_to_loop)){
  data[[i]]<-fread(files_to_loop[i])
}
data<-rbindlist(data,use.names=TRUE,fill=TRUE)

Upvotes: 0

Related Questions