Reputation: 39343

How to use a .SAS or SPS metadata file to read a CSV as a Pandas dataframe?

I have a big CSV file and it comes with two metadata description files. One has a .sas extension and the other a .sps. Opening them, they describe the CSV data format and categories. The files describe the data format and possible categories of each column. For example, a column with values 1 or 2 is mapped to yes and no.

How can I use these metadata files to help me read the CSV file?

I can easily read it using read_csv, but these files are useful to automatically create my columns with the possible categories. I can create a parser for them, but there must be a package or function to do it. Maybe I'm not using the correct search terms.

Here is the .sas file (sorry, it is Portuguese):

proc format;
Value $SG_AREA
        CH='Ciךncias Humanas'
        CN='Ciךncias da Natureza'
        LC='Linguagens e Cףdigos'
        MT='Matemבtica';

Value $TP_LINGUA
        0='Inglךs'
        1='Espanhol';

Value $IN_ITEM_ADAPTADO
        0='Nדo'
        1='Sim';


DATA WORK.ITENS_2018;
INFILE 'C:\ITENS_PROVA_2018.csv' /*local do arquivo*/
        LRECL=33
        FIRSTOBS=2
        DLM=';'
        MISSOVER
        DSD ;

INPUT
        CO_POSICAO       : BEST2.
        SG_AREA          : $CHAR2.
        CO_ITEM          : BEST6.
        TX_GABARITO      : $CHAR1.
        CO_HABILIDADE    : BEST2.
        TX_COR           : $CHAR7.
        CO_PROVA         : BEST3.
        TP_LINGUA        : $CHAR1.
        IN_ITEM_ADAPTADO : $CHAR1. ;

ATTRIB  SG_AREA          FORMAT = $SG_AREA20.;         
ATTRIB  TP_LINGUA        FORMAT = $TP_LINGUA8.;       
ATTRIB  IN_ITEM_ADAPTADO FORMAT = $IN_ITEM_ADAPTADO3.;

LABEL
CO_POSICAO='Posiחדo do Item na Prova'
SG_AREA='ֱrea de Conhecimento do Item'
CO_ITEM='Cףdigo do Item'
TX_GABARITO='Gabarito do Item'
CO_HABILIDADE='Habilidade do Item'
TX_COR='Cor da Prova'
CO_PROVA='Identificador da Prova'
TP_LINGUA='Lםngua Estrangeira '
IN_ITEM_ADAPTADO='Item pertencente א prova adaptada para Ledor'

;RUN;

And here you can see the equivalent .sps file:

GET DATA
  /TYPE=TXT
  /FILE= "C:\ITENS_PROVA_2018.csv" /*local do arquivo*/ 
  /DELCASE=LINE
  /DELIMITERS=";"
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /IMPORTCASE= ALL
  /VARIABLES=
CO_POSICAO F2.0
SG_AREA A2
CO_ITEM F6.0
TX_GABARITO A1
CO_HABILIDADE F2.0
TX_COR A7
CO_PROVA F3.0
TP_LINGUA A1       
IN_ITEM_ADAPTADO A1.
CACHE.
EXECUTE.
DATASET NAME ITENS_18 WINDOW=FRONT.

VARIABLE LABELS
CO_POSICAO  Posiחדo do Item na Prova
SG_AREA     ֱrea de Conhecimento do Item
CO_ITEM     Cףdigo do Item
TX_GABARITO Gabarito do Item
CO_HABILIDADE   Habilidade do Item
TX_COR      Cor da Prova
CO_PROVA    Identificador da Prova
TP_LINGUA       Lםngua Estrangeira
IN_ITEM_ADAPTADO    Item pertencente א prova adaptada para Ledor.


VALUE LABELS
SG_AREA
        "CH"    Ciךncias Humanas
        "CN"    Ciךncias da Natureza
        "LC"    Linguagens e Cףdigos
        "MT"    Matemבtica
/TP_LINGUA
        0   Inglךs
        1   Espanhol
/IN_ITEM_ADAPTADO
        0   Nדo
        1   Sim.

You can see that they describe the metadata for each column.

Upvotes: 7

Answers (2)

Stephen R

Reputation: 56

.sas is the program file extension for SAS, and is designed to be used via SAS. It is essentially a command file serving as a dictionary file.

.sps is the program file extension for SPSS, and is designed to be used via SPSS. It is essentially a command file serving as a dictionary file. I'd give a handy link here too but SPSS is an IBM product and their documentation is a hellish landscape none should tread.

What you're trying to do should be possible despite that. Pandas by itself is insufficient, as it has no functions built in to address these situations. Pandas support for SAS only extends to .sas7bdat data files, and for SPSS only extends to .sav data files.

Python (and Pandas) can read the .sas and .sps extensions because they're plain text files, but can't actually do anything with them.

Here are two paths for you to take to get what you're after.

1) Install SAS or SPSS on a trial, use it to read the data and then export in an alternative format.

2) Install and attempt to use the pyreadstat extension for Pandas.

It sounds like the pandas framework is your preference and for that to work you'll need to expand what it can do. In this case, with the pyreadstat extension. It is designed to work with SAS and SPSS data files and processes them far more efficiency than pandas by itself. This solution comes with a caveat.

Pyreadstat is itself a conversion of ReadStat. Quoting the pyreadstat readme file:

This module is a wrapper around the excellent Readstat C library by Evan Miller. 
Readstat is the library used in the back of the R library Haven, 
meaning pyreadstat is a python equivalent to R Haven.

If you look only at the pyreadstat files you won't find anything touching on .sas or .sps or dictionary files in general. Instead, you'll want to look at the readme for ReadStat found here. It has a section specifically covering such circumstances.

As of yet I have not tested the ReadStat commands and functions that exist for dictionary files in pyreadstat, so there is a possibility this will not work.

If you attempt this solution and it fails for you, follow up to the thread and I'll help you troubleshoot.

Upvotes: 2

Rene B.

Reputation: 7394

The clean way would be to export your SAS data either as XPORT or SAS7BDAT format files.

Afterwards you can use the pandas function pandas.read_sas:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html

The import numpy as np import pandas as pd

df = pd.read_sas('test.sas7bdat')

If you have large files you can use then the chunksize parameter to read only x file lines at a time, returns iterator. Or you can use iterator parameter in order to return an iterator for reading the file incrementally.

Upvotes: 0

How to use a .SAS or SPS metadata file to read a CSV as a Pandas dataframe?

Answers (2)

Related Questions