Tony Beans
Tony Beans

Reputation: 195

SAS INFILE is creating missing values for some columns

I'm experiencing some very strange behaviour. I have a CSV data set with lots of columns (3199) and 148 rows/observations. Only some of the variables are of interest, but I've parsed out and manipulated what I need only to find a many of the variables have values that are all missing. Looking in the CSV directly (where I can) and looking at the data into R confirms the data are NOT missing.

My DATA step looks like this:

 DATA WORK.MYFILE;
  INFILE '\\[SERVER NAME]\[FOLDERS]\RawData.csv'
  lrecl=32760
  DSD
  TRUNCOVER
  FIRSTOBS=3  ;
INPUT
    F1               : $ CHAR6.
    ACAAAE31RR       : ?? BEST32.
    ACAAAE32RR       : ?? BEST32.
    ACAAAE33RR       : ?? BEST32.
    ACAAAE3BRR       : ?? BEST32.

 /* lots and lots of lines like this */

    SHAW5564TT3R     : ?? BEST32.
    SHAW6599TT3R     : ?? BEST32.
    SHAX0099TTAR     : ?? BEST32. ;
 RUN;

Everything seemed fine until I pulled out the data I wanted and realized this forced some missing variables. My guess is that after some column (I don't know which) there's only missing values all the way to the end of the observation row, and this maybe turns later columns to missing?? (Clearly I'm new to SAS).

BACKGROUND

In case it's relevant.

1) The data has many missing observations... most of them in the first several dozen rows. (The data are time series starting in 1980Q1 and the variables I want are missing before 2000Q1.)

2) I'm on SAS EnterpriseGuide. I got the above code by using the IMPORT wizard, copying the code into a program, and then removing the lines of formatting code above INFILE and changing the path to the actual CSV rather than the temporary one SAS creates with the wizard. When I did this, I did a find and replace of all the $CHAR1. and other misspecified informats and changed everything to BEST32.

3) My hunch about forced-to-missing-ness after some column is because all the variables are listed alphabetically. The variables I want look like:ACAINDIRR, AEDINDIRR, BVAINDIRR, BVIINDIRR, MWIINDIRR, OHAINDIRR, OOHINDIRR. The first three variables each have 104 non-missing observations as expected. The fourth (BVI...) has 43 non-missing observations and the rest have zero non-missing observations.

4) I've tried a bunch of trouble-shoot things to fixed this. The only thing that got close was converting the CSV to at *.txt file. The fourth variable, BVIINDIRR, then had 104 non-missing observations, oddly. But, the alphabetically later ones all still have 0 non-missing obs.

Any help is greatly appreciated!!

EDIT

Removing the ?? from the INPUT commands doesn't result in any errors or warnings. I do not get any warnings about characters being converted to numeric. The only notes in the Log are:

  NOTE: The infile  '\\[SERVER NAME]\[FOLDERS]\RawData.csv'is:
  Filename= \\[SERVER NAME]\[FOLDERS]\RawData.csv,
  RECFM=V,LRECL=32760,File Size (bytes)=23765791,
  Last Modified=03Aug2016:08:50:21,
  Create Time=03Aug2016:10:52:33

NOTE: 148 records were read from the infile  '\\[SERVER NAME]\[FOLDERS]\RawData.csv'.
  The minimum record length was 32294.
  The maximum record length was 32760.
  One or more lines were truncated.
NOTE: The data set WORK.PARETO has 148 observations and 31998 variables.
NOTE: MVA_DSIO.OPEN_CLOSE| _DISARM|         STOP| _DISARM| 2016-08-05T08:47:01,746-04:00| _DISARM| WorkspaceServer| _DISARM| SAS| 
  _DISARM| | _DISARM| 148| _DISARM| 28418048| _DISARM| 10| _DISARM| 11| _DISARM| 66861615| _DISARM| 569154881| _DISARM| 
  0.296875| _DISARM| 0.672000| _DISARM| 1786020421.075000| _DISARM| 1786020421.747000| _DISARM| 0.281250| _DISARM| | _ENDDISARM 
NOTE: PROCEDURE| _DISARM|         STOP| _DISARM| 2016-08-05T08:47:01,746-04:00| _DISARM| WorkspaceServer| _DISARM| SAS| _DISARM| | 
  _DISARM| 682061824| _DISARM| 26718208| _DISARM| 10| _DISARM| 11| _DISARM| 72237092| _DISARM| 569155157| _DISARM| 4.750000| 
  _DISARM| 7.937000| _DISARM| 1786020413.810000| _DISARM| 1786020421.747000| _DISARM| 2.453125| _DISARM| | _ENDDISARM 
 NOTE: DATA statement used (Total process time):
  real time           7.93 seconds
                        08:45 Friday, August 5, 2016

  cpu time            4.75 seconds

Upvotes: 0

Views: 1627

Answers (2)

Joe
Joe

Reputation: 63434

Your LRECL is clearly insufficient. Assuming a non-UTF codepage file, your file is 23 million bytes but only 148 rows. That is an average of around 160000 characters per line. LRECL=32767 is not going to cut that.

Increase your LRECL (as suggested in comments, lrecl=1M is probably reasonable) and see if that solves your problem, or if you have other issues you need to solve with the specific character variables.

Upvotes: 1

Christopher Anderson
Christopher Anderson

Reputation: 176

Try reading the data from the missing columns in as character variables, using an informat such as $CHAR200 (something with a long length so you're less likely to be truncating it - with $CHAR1 everything after the first character won't show up). My guess is that all of your columns that are missing are character data, or at least have a character in them, while SAS is busy looking for something numeric - because you're telling it to when you specify best32.

Does the log happen to give you a note about character values being converted to numeric values? If not, what notes go along with the datastep you reference?

Upvotes: 1

Related Questions