Reputation: 39343
I have a big CSV file and it comes with two metadata description files. One has a .sas
extension and the other a .sps
. Opening them, they describe the CSV data format and categories. The files describe the data format and possible categories of each column. For example, a column with values 1 or 2 is mapped to yes and no.
How can I use these metadata files to help me read the CSV file?
I can easily read it using read_csv, but these files are useful to automatically create my columns with the possible categories. I can create a parser for them, but there must be a package or function to do it. Maybe I'm not using the correct search terms.
Here is the .sas
file (sorry, it is Portuguese):
proc format;
Value $SG_AREA
CH='Ciךncias Humanas'
CN='Ciךncias da Natureza'
LC='Linguagens e Cףdigos'
MT='Matemבtica';
Value $TP_LINGUA
0='Inglךs'
1='Espanhol';
Value $IN_ITEM_ADAPTADO
0='Nדo'
1='Sim';
DATA WORK.ITENS_2018;
INFILE 'C:\ITENS_PROVA_2018.csv' /*local do arquivo*/
LRECL=33
FIRSTOBS=2
DLM=';'
MISSOVER
DSD ;
INPUT
CO_POSICAO : BEST2.
SG_AREA : $CHAR2.
CO_ITEM : BEST6.
TX_GABARITO : $CHAR1.
CO_HABILIDADE : BEST2.
TX_COR : $CHAR7.
CO_PROVA : BEST3.
TP_LINGUA : $CHAR1.
IN_ITEM_ADAPTADO : $CHAR1. ;
ATTRIB SG_AREA FORMAT = $SG_AREA20.;
ATTRIB TP_LINGUA FORMAT = $TP_LINGUA8.;
ATTRIB IN_ITEM_ADAPTADO FORMAT = $IN_ITEM_ADAPTADO3.;
LABEL
CO_POSICAO='Posiחדo do Item na Prova'
SG_AREA='ֱrea de Conhecimento do Item'
CO_ITEM='Cףdigo do Item'
TX_GABARITO='Gabarito do Item'
CO_HABILIDADE='Habilidade do Item'
TX_COR='Cor da Prova'
CO_PROVA='Identificador da Prova'
TP_LINGUA='Lםngua Estrangeira '
IN_ITEM_ADAPTADO='Item pertencente א prova adaptada para Ledor'
;RUN;
And here you can see the equivalent .sps
file:
GET DATA
/TYPE=TXT
/FILE= "C:\ITENS_PROVA_2018.csv" /*local do arquivo*/
/DELCASE=LINE
/DELIMITERS=";"
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE= ALL
/VARIABLES=
CO_POSICAO F2.0
SG_AREA A2
CO_ITEM F6.0
TX_GABARITO A1
CO_HABILIDADE F2.0
TX_COR A7
CO_PROVA F3.0
TP_LINGUA A1
IN_ITEM_ADAPTADO A1.
CACHE.
EXECUTE.
DATASET NAME ITENS_18 WINDOW=FRONT.
VARIABLE LABELS
CO_POSICAO Posiחדo do Item na Prova
SG_AREA ֱrea de Conhecimento do Item
CO_ITEM Cףdigo do Item
TX_GABARITO Gabarito do Item
CO_HABILIDADE Habilidade do Item
TX_COR Cor da Prova
CO_PROVA Identificador da Prova
TP_LINGUA Lםngua Estrangeira
IN_ITEM_ADAPTADO Item pertencente א prova adaptada para Ledor.
VALUE LABELS
SG_AREA
"CH" Ciךncias Humanas
"CN" Ciךncias da Natureza
"LC" Linguagens e Cףdigos
"MT" Matemבtica
/TP_LINGUA
0 Inglךs
1 Espanhol
/IN_ITEM_ADAPTADO
0 Nדo
1 Sim.
You can see that they describe the metadata for each column.
Upvotes: 7
Views: 1809
Reputation: 56
.sas is the program file extension for SAS, and is designed to be used via SAS. It is essentially a command file serving as a dictionary file.
.sps is the program file extension for SPSS, and is designed to be used via SPSS. It is essentially a command file serving as a dictionary file. I'd give a handy link here too but SPSS is an IBM product and their documentation is a hellish landscape none should tread.
What you're trying to do should be possible despite that. Pandas by itself is insufficient, as it has no functions built in to address these situations. Pandas support for SAS only extends to .sas7bdat data files, and for SPSS only extends to .sav data files.
Python (and Pandas) can read the .sas and .sps extensions because they're plain text files, but can't actually do anything with them.
Here are two paths for you to take to get what you're after.
1) Install SAS or SPSS on a trial, use it to read the data and then export in an alternative format.
2) Install and attempt to use the pyreadstat extension for Pandas.
It sounds like the pandas framework is your preference and for that to work you'll need to expand what it can do. In this case, with the pyreadstat extension. It is designed to work with SAS and SPSS data files and processes them far more efficiency than pandas by itself. This solution comes with a caveat.
Pyreadstat is itself a conversion of ReadStat. Quoting the pyreadstat readme file:
This module is a wrapper around the excellent Readstat C library by Evan Miller.
Readstat is the library used in the back of the R library Haven,
meaning pyreadstat is a python equivalent to R Haven.
If you look only at the pyreadstat files you won't find anything touching on .sas or .sps or dictionary files in general. Instead, you'll want to look at the readme for ReadStat found here. It has a section specifically covering such circumstances.
As of yet I have not tested the ReadStat commands and functions that exist for dictionary files in pyreadstat, so there is a possibility this will not work.
If you attempt this solution and it fails for you, follow up to the thread and I'll help you troubleshoot.
Upvotes: 2
Reputation: 7394
The clean way would be to export your SAS data either as XPORT or SAS7BDAT format files.
Afterwards you can use the pandas function pandas.read_sas
:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html
The import numpy as np import pandas as pd
df = pd.read_sas('test.sas7bdat')
If you have large files you can use then the chunksize
parameter to read only x file lines at a time, returns iterator. Or you can use iterator
parameter
in order to return an iterator for reading the file incrementally.
Upvotes: 0