Reputation: 307
I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
Upvotes: 1
Views: 1369
Reputation: 149
I would change the working directory for this task... Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)
Upvotes: 0
Reputation: 980
You can also use purrr
for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames()
part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df()
to read all files and return a data frame. Specifying .id
is optional and results in an ID column the values of which are based on the names of in_files
.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
Upvotes: 2
Reputation: 4187
The following steps should give you what you want:
Load the foreign
package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr
it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
Upvotes: 0