Jklein
Jklein

Reputation: 101

Fastest way to import SPSS data into R as dataframe

I need to import an SPSS .sav file into R every day as a data frame without value labels. The file is 120,000+ obs and growing. This process is getting incredibly slow, so I want to make sure I'm using the fastest possible method. I've been playing around with the functions in foreign, haven, and memisc. I'm working with RDS if that makes a difference.

Edit: My file is 126343 x 33067 and 12.1 GB.I'm just simply running the following code:

library(haven)
data <- read_sav(file)

I can't share this file, but to attempt to replicate, I did:

library(haven)
n <- 126343
exd <- data.frame(c(replicate(2000, sample(letters, n, replace = TRUE),
                              simplify = FALSE),
                    replicate(1306, runif(n),
                              simplify = FALSE)))
dim(exd)
## [1] 126343    3306
tmp <- tempfile(fileext = ".sav")
write_sav(exd, tmp)
system.time(exd2 <- read_sav(tmp))
##   user  system elapsed 
##  173.34   13.94   187.66 

Thanks!

Upvotes: 0

Views: 3753

Answers (2)

Ista
Ista

Reputation: 10437

120000 isn't very big. Unless you have a very low resource system I wouldn't expect this to be much of a bottleneck at all. On my mid-range laptop it takes just a few seconds to read a 122000 X 150 .sav file:

library(haven)
n <- 122000
exd <- data.frame(c(replicate(50, sample(letters, n, replace = TRUE),
                              simplify = FALSE),
                    replicate(100, runif(n),
                              simplify = FALSE)))
dim(exd)
## [1] 122000    150
tmp <- tempfile(fileext = ".sav")
write_sav(exd, tmp)
system.time(exd2 <- read_sav(tmp))
##   user  system elapsed 
##  1.913   0.096   2.015 

Since I can't reproduce the problem as you've described it you should provide more details to make it clearer what the issue is. If you show the code and (a subset or simulation of) the data you're working with you might get some help identifying the likely bottleneck.

Upvotes: 1

Carlos Santillan
Carlos Santillan

Reputation: 1087

The haven package (part of the tidyverse) would be my choice. But have not used it on datasets as big

https://github.com/tidyverse/haven

Upvotes: 0

Related Questions