Reputation: 101
I need to import an SPSS .sav file into R every day as a data frame without value labels. The file is 120,000+ obs and growing. This process is getting incredibly slow, so I want to make sure I'm using the fastest possible method. I've been playing around with the functions in foreign, haven, and memisc. I'm working with RDS if that makes a difference.
Edit: My file is 126343 x 33067 and 12.1 GB.I'm just simply running the following code:
library(haven)
data <- read_sav(file)
I can't share this file, but to attempt to replicate, I did:
library(haven)
n <- 126343
exd <- data.frame(c(replicate(2000, sample(letters, n, replace = TRUE),
simplify = FALSE),
replicate(1306, runif(n),
simplify = FALSE)))
dim(exd)
## [1] 126343 3306
tmp <- tempfile(fileext = ".sav")
write_sav(exd, tmp)
system.time(exd2 <- read_sav(tmp))
## user system elapsed
## 173.34 13.94 187.66
Thanks!
Upvotes: 0
Views: 3753
Reputation: 10437
120000 isn't very big. Unless you have a very low resource system I wouldn't expect this to be much of a bottleneck at all. On my mid-range laptop it takes just a few seconds to read a 122000 X 150 .sav
file:
library(haven)
n <- 122000
exd <- data.frame(c(replicate(50, sample(letters, n, replace = TRUE),
simplify = FALSE),
replicate(100, runif(n),
simplify = FALSE)))
dim(exd)
## [1] 122000 150
tmp <- tempfile(fileext = ".sav")
write_sav(exd, tmp)
system.time(exd2 <- read_sav(tmp))
## user system elapsed
## 1.913 0.096 2.015
Since I can't reproduce the problem as you've described it you should provide more details to make it clearer what the issue is. If you show the code and (a subset or simulation of) the data you're working with you might get some help identifying the likely bottleneck.
Upvotes: 1
Reputation: 1087
The haven package (part of the tidyverse) would be my choice. But have not used it on datasets as big
https://github.com/tidyverse/haven
Upvotes: 0