Uma Sandeep
Uma Sandeep

Reputation: 3

How to read DATA type from a sas7bdat file through pyreadstat?meta.original_variable_types giving different values?

import pyreadstat
df, meta = pyreadstat.read_sas7bdat('c:/ae.sas7bdat')
print(meta.original_variable_types)

This code prints the values as following

{
    "TRIAL_NAME":"$",
    "SITEMNEMONIC":"$",
    "PATIENTNUMBER":"$",
    "VISITID":"BEST",
    "VISITREFNAME":"$",
    "SEQ":"BEST",
    "PANELNAME":"$",
    "STATUS":"DND",
    "COMPDT":"$",
    "COMPTM":"$",
    "SPECID":"$"
}

From SAS documentation I understood that $ represents character and BEST represents Numeric. But what are other types then? When I opened my file in SAS viewer I can see type as character and Numeric. How can I retrieve that? Attaching the image of meta information from SAS viewer. I want to retrieve that type

meta information

Upvotes: 0

Views: 1459

Answers (1)

Otto Fajardo
Otto Fajardo

Reputation: 3407

If you only need type, then it is easy: in pyreadstat if $ then it is charachter, if not, it is numeric.

What you are seeing in pyreadstat is what you have in the format column of SAS without the variable width (which is stored separately in pyreadstat in meta.variable_display_width). You will observe in your screenshot that all character variables have a format that starts with $ , the number that comes next is the variable width.

SAS has only two types: charachter and number, therefore if not a character it is a number. The format tells SAS how to display the variable. For characters it is just display the charachter ($) with cerain width, as there are no more alternatives. Numbers can be displayed in different ways, things like BEST, but also as DATE if they represent the number of days since Jan 1st 1960, as DATETIME if they represent the number of seconds since Jan 1st 1960, etc.

In case formats are missing, you can check if the data in a column is a string, in which case the type in SAS was character. Anything else was numeric:

import pyreadstat

df, meta = pyreadstat.read_xport('file.xpt')
dtypes = zip(list(df.columns), list(df.dtypes))

sas_types = dict()
for colname, coltype in dtypes:
    if coltype == object:
        nonan = df[colname].dropna()
        if not nonan.empty:
            if type(nonan[0]) == str:
                sas_types[colname] = 'character'
            else:
                sas_types[colname] = 'numeric'
        else:
            sas_types[colname] = '?'
    else:
        sas_types[colname] = 'numeric'

EDIT:

In pyreadstat version 1.1.0 you have now meta.readstat_variable_types. This is a dictionary with variable name as key, while the value is the binary type Readstat extracted from the file. In the case of SAS and SPSS you may get either 'string' (character) or 'double' (numeric). In Stata you may also get 'int8', 'int32' and float.

Upvotes: 1

Related Questions