Reputation: 331
I have a 10 GB .dta Stata file and I am trying to read it into 64-bit R 3.3.1. I am working on a virtual machine with about 130 GB of RAM (4 TB HD) and the .dta file is about 3 million rows and somewhere between 400 and 800 variables.
I know data.table() is the fastest way to read in .txt and .csv files, but does anyone have a recommendation for reading largeish .dta files into R? Reading the file into Stata as a .dta file requires about 20-30 seconds, although I need to set my working memory max prior to opening the file (I set the max at 100 GB).
I have not tried importing to .csv in Stata, but I hope to avoid touching the file with Stata. A solution is found via Using memisc to import stata .dta file into R but this assumes RAM is scarce. In my case, I should have sufficient RAM to work with the file.
Upvotes: 8
Views: 8043
Reputation: 473
Since this post is the top of the search results, I re-ran the benchmarking on the current version of haven
and readstata13
. It seems that both packages at this point are comparable, and haven
is slightly better. In terms of time-complexity, they both approximate linear as a function of number of lines.
Here is the code to run the benchmark:
sizes <- 10^(seq(2, 7, .5))
benchmark_read <- function(n_rows){
start_t_haven <- Sys.time()
maisanta_dataset <- read_dta("my_large_file.dta"), n_max = n_rows)
end_t_haven <- Sys.time()
start_t_readstata13 <- Sys.time()
maisanta_dataset <- read.dta13("my_large_file.dta", select.rows = n_rows)
end_t_readstata13 <- Sys.time()
tibble(size = n_rows,
haven_time = end_t_haven - start_t_haven,
readstata13_time = end_t_readstata13 - start_t_readstata13) %>%
return()
}
benchmark_results <-
lapply(sizes, benchmark_read) %>%
bind_rows()
Upvotes: 5
Reputation: 192
The fastest way to load a large Stata dataset in R is using the readstata13
package. I have compared the performance of foreign
, readstata13
, and haven
packages on a large dataset in this post and the results repeatedly showed that readstata13
is the fastest available package for reading Stata dataset in R.
Upvotes: 5
Reputation: 5942
I recommend the haven
R package. Unlike foreign
, It can read the latest Stata formats:
library(haven)
data <- read_dta('myfile.dta')
Not sure how fast it is compared to other options, but your choices for reading Stata files in R are rather limited. My understanding is that haven
wraps a C library, so it's probably your fastest option.
Upvotes: 2