Keerthimanu Gattu
Keerthimanu Gattu

Reputation: 39

Convert SAS data to a python dataframe

I have this small little code here to import a SAS file into dataframe in Python.

from sas7bdat import SAS7BDAT
with SAS7BDAT('some_file.sas7bdat') as f:
    df = f.to_data_frame()
print df.head(5)

The code runs forever without any output. The sas file I'm trying to import is 1.5gb.

Upvotes: 1

Views: 11020

Answers (2)

ThePyGuy
ThePyGuy

Reputation: 18406

I know it's a very late response but I think my answer is going to be useful for future readers. Few months back when I had to read and process SAS data either SAS7BDAT or xpt format SAS data, I was looking for different libraries and packages available to read these datasets, among them, I shortlisted the libraries as follows:

  1. pandas (It was on high priority list due to community support and performance)
  2. SAS7BDAT (Is able to read SAS7BDAT files only, and last release July 2019)
  3. pyreadstat (Promising performance as per the documentation plus ability to read meta data)

Before picking up any package, I did some performance benchmarking, although I don't have benchmark result at the time of posting this answer, I found pyreadstat to be faster than pandas, (seems like it's using multiprocessing while reading the data as mentioned in the documentation but I'm not exactly sure), and also the memory consumption and the footprint was much lesser while using pyreadstat in comparison to pandas, plus it is able to read the metadata, and even allows to read the metadeta only, so I finally ended up picking pyreadstat.

The data read using pyreadstat is already in the form of dataframe, so it doesn't need some manual conversion to pandas dataframe.

import pyreadstat
filename = 'some_file.sas7bdat'

df, _ = pyreadstat.read_sas7bdat(filename)

Here is some benchmark (Time to read the file to a dataframe) performed on real data (Raw and Standardized) for CDISC, the file size ranges from some KB to some MB, and includes both xpt and sas7bdat file formats:

Reading ADAE.xpt 49.06 KB for 100 loops:
    Pandas Average time : 0.02232 seconds
    Pyreadstat Average time : 0.04819 seconds
----------------------------------------------------------------------------
Reading ADIE.xpt 27.73 KB for 100 loops:
    Pandas Average time : 0.01610 seconds
    Pyreadstat Average time : 0.03981 seconds
----------------------------------------------------------------------------
Reading ADVS.xpt 386.95 KB for 100 loops:
    Pandas Average time : 0.03248 seconds
    Pyreadstat Average time : 0.07580 seconds
----------------------------------------------------------------------------
Reading beck.sas7bdat 14.72 MB for 50 loops:
    Pandas Average time : 5.30275 seconds
    Pyreadstat Average time : 0.60373 seconds
----------------------------------------------------------------------------
Reading p0_qs.sas7bdat 42.61 MB for 50 loops:
    Pandas Average time : 15.53942 seconds
    Pyreadstat Average time : 1.69885 seconds
----------------------------------------------------------------------------
Reading ta.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04017 seconds
    Pyreadstat Average time : 0.00152 seconds
----------------------------------------------------------------------------
Reading te.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01052 seconds
    Pyreadstat Average time : 0.00109 seconds
----------------------------------------------------------------------------
Reading ti.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.04446 seconds
    Pyreadstat Average time : 0.00179 seconds
----------------------------------------------------------------------------
Reading ts.sas7bdat 33.00 KB for 100 loops:
    Pandas Average time : 0.01273 seconds
    Pyreadstat Average time : 0.00129 seconds
----------------------------------------------------------------------------
Reading t_frcow.sas7bdat 14.59 MB for 50 loops:
    Pandas Average time : 7.93266 seconds
    Pyreadstat Average time : 0.92295 seconds

As you can see, for xpt files, the time to read the files isn't better, but for sas7bdat files, pyreadstat just outperforms pandas.

The above benchmark was performed on pyreadstat 1.0.9, pandas 1.2.4 and Python 3.7.5.

Upvotes: 1

Mohamed Ali JAMAOUI
Mohamed Ali JAMAOUI

Reputation: 14689

You should use the native pandas function pandas.read_sas it's faster than iterating through the file as you did.

Here is the documentation of the pandas.read_sas function. This code sample should be sufficient to load the file:

df = pandas.read_sas('some_file.sas7bdat')
print(df.head())

Upvotes: 4

Related Questions