Reputation: 1278
I have a csv file which has user entered comments, that occasionally end in newlines. This causes errors when using infile
, and results in rows with mis-ordered variables.
I've tried termstr=crlf
or cr
or lf
. With lf
or crlf
I get zero observations read in, and with cr
, I get the same error as if I leave it out.
If I open the file in excel and do a Replace ctrl-j (e.g. http://blog.contextures.com/archives/2013/05/28/find-and-replace-line-breaks-in-excel/), it solves the problem and the data reads in perfectly.
The file is coming from backendless, and the original data is from smart phones.
edit: I solved this using http://support.sas.com/kb/41/116.html as a source, and ended up with the following
data _null_;
infile 'test.csv' recfm=n lrecl=50000;
file 'testout.csv' recfm=n;
input a $char1.;
if a = '"' then do;
d+1;
if d = 2 then d = 0;
end;
if a = "," and d = 0 then do;
c+1;
end;
if a = '0A'x then do;
if c = 2 then do;
c = 0;
put '0A'x;
end;
end;
else put a $char1.;
run;
The first conditional tracks whether the delimiters I'm seeing (,
) are true delimiters or just inside a string, the second conditional counts the number of variables I've seen so far. The last conditional deletes all lf
's except for after I've seen the proper number of variables (c=2
).
Upvotes: 0
Views: 7332
Reputation: 63434
If your data have normal CRLF termination strings, this will work. This was used to read in a file created in excel with 3 columns, and alt+enter in the middle of strings in a few places.
data test;
infile "c:\temp\newlines.csv" termstr=crlf dlm=',' dsd;
format stuff stuff2 stuff3 $100.;
input
stuff $
stuff2 $
stuff3 $
;
run;
If that ends up with only stuff
having a value (of your first 100 characters, or whatnot), then you probably do not have CRLF in your data. If it's coming from a third party service (as yours seems to be), the first place I'd go is to that service or application, and see what options you have there.
You can always see what's coming in using recfmt=V
and looking at the HEX output.
data test2;
infile "c:\temp\newlines.csv" recfm=f ls=20 end=eof flowover;
format indata $20.;
do until (eof);
input @1 indata $20.;
put indata= hex.;
put indata=;
end;
stop;
run;
That gives you 20 characters at a time, the hex and the normal ASCII representation. CR is "0D", LF is "0A". So find your first normal end of line [look at it visually, find what should be in the last column], and see if there are 0D0A after it. If not, then you don't have CRLF terminators, which is a problem.
Note, when you open it in Excel, excel undoubtedly adds them for you, so that isn't going to help solve this problem - you have to look at the file raw.
For example, the above with my junk data file generates for the first 2 input passes:
indata=73747566662C224D6F72650A5374756666222C4D
indata=stuff,"More Stuff",M
indata=792073747566660D0A6F6E652C74776F2C746872
indata=y stuff one,two,thr
In the first line, there's an 0A between "More" and "Stuff". [More is 4D6F7265, Stuff is 5374756666]. That's all on one line, the original CSV is
stuff, "More Stuff", My stuff
one,two,three
except with a linefeed instead of the space in "More Stuff".
On the second line, there is 0D0A after "stuff" (7374756666, same as above but lower case s is 73 not 53). That's CR+LF, the normal line terminator.
If you don't have line terminators in your original file, you may need to read it in with a variable input. You can have dlm=','
but recfm=v
, which allows you to just read in a single delimited bit at a time.
data test3;
infile "c:\temp\newlines.csv" recfm=v dlm=',' dsd end=eof flowover termstr=crlf;
format stuff stuff2 stuff3 $100.;
input stuff $ @;
input stuff2 $ @;
input stuff3 $ @;
run;
That uses recfm=v
basically to not worry about the lack of end of line terminators. There are many other solutions to this problem (recfm=n
and reading stream input, for example, is another good one). Search Google/etc. for more details once you know more about your particular problem.
Upvotes: 1
Reputation: 83
You could try using the missover or truncover infile options, in case SAS reading past the end of the line when the newline character is missing is causing your errors.
You could also just compress out the newline character from the buffer automatic variable after your infile statement and before your input statement. For example:
data test;
infile "myfile.csv";
* Hold the current line being read;
input @;
* Compress out the LF -- you could also try this with CR and CRLF;
_infile_ = compress(_infile_,'0A'x);
* Your input statement should now read the cleaned up buffer;
input myvar ;
run;
Upvotes: 0