Reputation: 317
I have around 50+ csv files that all share the same 4 columns in this order:
REG_ID region age age_num
and then years anything from 1990 till 2016 in this format:
REG_ID region age age_num y_1992 y_1993 y_1994 y_2014.15
and I was wondering what could be the best way to merge them. Going thru each to add the missing years-columns would be time consuming and likely lead to errors. The end format would be something like this:
REG_ID region reg_num age age_num y_1991 y_1992 y_1993
BFM2 Boucle 1 c_0_4 0 770 NA 120
BFM2 Boucle 1 c_5_9 5 810 NA 11
BFM2 Boucle 1 c_10_14 10 704 NA 130
BFM2 Boucle 1 c_15_19 15 71 NA 512
BFM2 Boucle 1 c_20_24 20 181 NA 712
Upvotes: 1
Views: 83
Reputation: 3017
Here's a way you can do it using tidyverse tools. First use dir
to get a vector of csv
paths, then use purrr:map
to read them all in, returning a list of the data frames, and then use purrr::reduce
to merge all the data frames using dplyr::left_join
.
library(readr)
library(purrr)
library(dplyr)
read_csv(
"REG_ID,region,reg_num,age,age_num,y_1991
BFM2,Boucle,1,c_0_4,0,770
BFM2,Boucle,1,c_5_9,5,810
BFM2,Boucle,1,c_10_14,10,704
BFM2,Boucle,1,c_15_19,15,71
BFM2,Boucle,1,c_20_24,20,181") %>%
write_csv("df_91.csv")
read_csv(
"REG_ID,region,reg_num,age,age_num,y_1992
BFM2,Boucle,1,c_0_4,0,NA
BFM2,Boucle,1,c_5_9,5,NA
BFM2,Boucle,1,c_10_14,10,NA
BFM2,Boucle,1,c_15_19,15,NA
BFM2,Boucle,1,c_20_24,20,NA") %>%
write_csv("df_92.csv")
read_csv(
"REG_ID,region,reg_num,age,age_num,y_1993
BFM2,Boucle,1,c_0_4,0,120
BFM2,Boucle,1,c_5_9,5,11
BFM2,Boucle,1,c_10_14,10,130
BFM2,Boucle,1,c_15_19,15,512
BFM2,Boucle,1,c_20_24,20,712") %>%
write_csv("df_93.csv")
dir(".", "\\.csv", full.names = TRUE) %>%
map(read_csv) %>%
reduce(left_join, by = c("REG_ID", "region", "reg_num", "age", "age_num"))
#> # A tibble: 5 x 8
#> REG_ID region reg_num age age_num y_1991 y_1992 y_1993
#> <chr> <chr> <int> <chr> <int> <int> <chr> <int>
#> 1 BFM2 Boucle 1 c_0_4 0 770 <NA> 120
#> 2 BFM2 Boucle 1 c_5_9 5 810 <NA> 11
#> 3 BFM2 Boucle 1 c_10_14 10 704 <NA> 130
#> 4 BFM2 Boucle 1 c_15_19 15 71 <NA> 512
#> 5 BFM2 Boucle 1 c_20_24 20 181 <NA> 712
Upvotes: 2
Reputation: 950
I think the best way would be:
library(data.table)
library(stringr)
data<-list("vector")
files_to_loop<-list.vector()[str_detect(list.vector(),".csv")]
for (i in 1:length(files_to_loop)){
data[[i]]<-fread(files_to_loop[i])
}
data<-rbindlist(data,use.names=TRUE,fill=TRUE)
Upvotes: 0