Reputation: 39
I have this small little code here to import a SAS file into dataframe in Python.
from sas7bdat import SAS7BDAT
with SAS7BDAT('some_file.sas7bdat') as f:
df = f.to_data_frame()
print df.head(5)
The code runs forever without any output. The sas file I'm trying to import is 1.5gb.
Upvotes: 1
Views: 11020
Reputation: 18406
I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS
data either SAS7BDAT
or xpt
format SAS
data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:
pandas
(It was on high priority list due to community support and
performance)SAS7BDAT
(Is able to read SAS7BDAT
files only, and last release July 2019)pyreadstat
(Promising performance as per the documentation plus ability to read meta data)Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat
to be faster than pandas
, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat
in comparison to pandas
, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat
.
The data read using pyreadstat
is already in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.
import pyreadstat
filename = 'some_file.sas7bdat'
df, _ = pyreadstat.read_sas7bdat(filename)
Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:
Reading ADAE.xpt 49.06 KB for 100 loops:
Pandas Average time : 0.02232 seconds
Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
Pandas Average time : 0.01610 seconds
Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
Pandas Average time : 0.03248 seconds
Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
Pandas Average time : 5.30275 seconds
Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
Pandas Average time : 15.53942 seconds
Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.04017 seconds
Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.01052 seconds
Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.04446 seconds
Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
Pandas Average time : 0.01273 seconds
Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
Pandas Average time : 7.93266 seconds
Pyreadstat Average time : 0.92295 seconds
As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.
The above benchmark was performed on pyreadstat 1.0.9, pandas 1.2.4 and Python 3.7.5.
Upvotes: 1
Reputation: 14689
You should use the native pandas function pandas.read_sas
it's faster than iterating through the file as you did.
Here is the documentation of the pandas.read_sas
function. This code sample should be sufficient to load the file:
df = pandas.read_sas('some_file.sas7bdat')
print(df.head())
Upvotes: 4