Reputation: 195
I'm experiencing some very strange behaviour. I have a CSV data set with lots of columns (3199) and 148 rows/observations. Only some of the variables are of interest, but I've parsed out and manipulated what I need only to find a many of the variables have values that are all missing. Looking in the CSV directly (where I can) and looking at the data into R confirms the data are NOT missing.
My DATA step looks like this:
DATA WORK.MYFILE;
INFILE '\\[SERVER NAME]\[FOLDERS]\RawData.csv'
lrecl=32760
DSD
TRUNCOVER
FIRSTOBS=3 ;
INPUT
F1 : $ CHAR6.
ACAAAE31RR : ?? BEST32.
ACAAAE32RR : ?? BEST32.
ACAAAE33RR : ?? BEST32.
ACAAAE3BRR : ?? BEST32.
/* lots and lots of lines like this */
SHAW5564TT3R : ?? BEST32.
SHAW6599TT3R : ?? BEST32.
SHAX0099TTAR : ?? BEST32. ;
RUN;
Everything seemed fine until I pulled out the data I wanted and realized this forced some missing variables. My guess is that after some column (I don't know which) there's only missing values all the way to the end of the observation row, and this maybe turns later columns to missing?? (Clearly I'm new to SAS).
BACKGROUND
In case it's relevant.
1) The data has many missing observations... most of them in the first several dozen rows. (The data are time series starting in 1980Q1 and the variables I want are missing before 2000Q1.)
2) I'm on SAS EnterpriseGuide. I got the above code by using the IMPORT wizard, copying the code into a program, and then removing the lines of formatting code above INFILE
and changing the path to the actual CSV rather than the temporary one SAS creates with the wizard. When I did this, I did a find and replace of all the $CHAR1.
and other misspecified informats
and changed everything to BEST32.
3) My hunch about forced-to-missing-ness after some column is because all the variables are listed alphabetically. The variables I want look like:ACAINDIRR
, AEDINDIRR
, BVAINDIRR
, BVIINDIRR
, MWIINDIRR
, OHAINDIRR
, OOHINDIRR
. The first three variables each have 104 non-missing observations as expected. The fourth (BVI...
) has 43 non-missing observations and the rest have zero non-missing observations.
4) I've tried a bunch of trouble-shoot things to fixed this. The only thing that got close was converting the CSV to at *.txt file. The fourth variable, BVIINDIRR
, then had 104 non-missing observations, oddly. But, the alphabetically later ones all still have 0 non-missing obs.
Any help is greatly appreciated!!
Removing the ??
from the INPUT
commands doesn't result in any errors or warnings. I do not get any warnings about characters being converted to numeric. The only notes in the Log are:
NOTE: The infile '\\[SERVER NAME]\[FOLDERS]\RawData.csv'is:
Filename= \\[SERVER NAME]\[FOLDERS]\RawData.csv,
RECFM=V,LRECL=32760,File Size (bytes)=23765791,
Last Modified=03Aug2016:08:50:21,
Create Time=03Aug2016:10:52:33
NOTE: 148 records were read from the infile '\\[SERVER NAME]\[FOLDERS]\RawData.csv'.
The minimum record length was 32294.
The maximum record length was 32760.
One or more lines were truncated.
NOTE: The data set WORK.PARETO has 148 observations and 31998 variables.
NOTE: MVA_DSIO.OPEN_CLOSE| _DISARM| STOP| _DISARM| 2016-08-05T08:47:01,746-04:00| _DISARM| WorkspaceServer| _DISARM| SAS|
_DISARM| | _DISARM| 148| _DISARM| 28418048| _DISARM| 10| _DISARM| 11| _DISARM| 66861615| _DISARM| 569154881| _DISARM|
0.296875| _DISARM| 0.672000| _DISARM| 1786020421.075000| _DISARM| 1786020421.747000| _DISARM| 0.281250| _DISARM| | _ENDDISARM
NOTE: PROCEDURE| _DISARM| STOP| _DISARM| 2016-08-05T08:47:01,746-04:00| _DISARM| WorkspaceServer| _DISARM| SAS| _DISARM| |
_DISARM| 682061824| _DISARM| 26718208| _DISARM| 10| _DISARM| 11| _DISARM| 72237092| _DISARM| 569155157| _DISARM| 4.750000|
_DISARM| 7.937000| _DISARM| 1786020413.810000| _DISARM| 1786020421.747000| _DISARM| 2.453125| _DISARM| | _ENDDISARM
NOTE: DATA statement used (Total process time):
real time 7.93 seconds
08:45 Friday, August 5, 2016
cpu time 4.75 seconds
Upvotes: 0
Views: 1627
Reputation: 63434
Your LRECL is clearly insufficient. Assuming a non-UTF codepage file, your file is 23 million bytes but only 148 rows. That is an average of around 160000 characters per line. LRECL=32767 is not going to cut that.
Increase your LRECL (as suggested in comments, lrecl=1M
is probably reasonable) and see if that solves your problem, or if you have other issues you need to solve with the specific character variables.
Upvotes: 1
Reputation: 176
Try reading the data from the missing columns in as character variables, using an informat such as $CHAR200 (something with a long length so you're less likely to be truncating it - with $CHAR1 everything after the first character won't show up). My guess is that all of your columns that are missing are character data, or at least have a character in them, while SAS is busy looking for something numeric - because you're telling it to when you specify best32.
Does the log happen to give you a note about character values being converted to numeric values? If not, what notes go along with the datastep you reference?
Upvotes: 1