Reputation: 603
I am importing a flat .txt file with both row and column delimiters. The problem is that the row delimiter is used to reduce file size and so often the rest of the columns are skipped. Another problem is that the length of the longest character is unknown and so, if this char string is truncated then we lose the delimiter and the whole structure falls apart.
An explicit example of the problems I am facing include
.txt file
Var1'~'Var2'~'Var3'~'Var4'~'Var5'~'Var6'#@#@'
1'~''#@#@'
This is going to be a really long string as an example of a situation where the long string is very large and so the truncated string does not indicate a delimiter and we lose data '#@#@'
1'~' 2'~' 3'~' 4'~' 5'~' 6'#@#@'
1'~' 2'~' 3'~''#@#@'
I am having a lot of problems trying to import this data due to a number of reasons:
putting a very large length for the character variables makes the import process very time consuming and we do not know the length of the longest character var so each iteration takes more time
I have not yet found a way to deal with both the column and row delimiter when the structure means that the next row can be signalled before all the columns have been filled in i.e. can't just make an extra column for the row delimiter and drop it.
SAS code for which I have tried:
data want;
infile "file-location" dlmstr = "#@#@" dsd recfm = F lrecl=10000000000;
informat var $200.
input var $ @@;
run;
Any experience and insight is greatly appreciated.
Upvotes: 3
Views: 291
Reputation: 63424
If you have one record per line, then just use missover
or truncover
option on infile; that will tell SAS to stop reading after it hits EOL. You'll have to deal with the weird end delimiter only insofar as it will cause an error if you try to read it into a numeric; you could possibly try to remove the string first in preprocessing.
data want;
infile "yourfile-location" dlmstr="'~'" dsd lrecl=32767 truncover;
input @;
_infile_ = tranwrd(_infile_,"'#@#@'"," ");
input var1 var2 var3 var4 var5 var6;
run;
The way you wrote it there could also work, if the above doesn't for some reason; basically read in the string twice with two different delimiter options, once with "'#@#@'" dlmstr, then with "'~'" dlmstr. Or rather than read it in twice, read it in once with the first, then parse it with the second.
data want;
infile "yourfile-location" dlmstr="'#@#@'" dsd lrecl=32767;
input @;
array var[6] var1-var6;
do _i = 1 to countc(_infile_,"~")+1;
var[_i] = scan(_infile_,_i,"~");
end;
run;
The above isn't perfect as it doesn't deal with those quotation marks around the delimiter, but you can figure that out depending on details - is that quotation mark safe to just compress out entirely pre-input, or do you need to do some fancy footwork with SUBSTR?
As far as the string variable length goes, most likely what's taking time is writing out the file. Use options compress=char;
to turn on dataset compression, assuming your ultimate use for these files is compatible with that (if you're just running SAS code on them, it should be). Then it won't try to write out the full variable length. If that fails, you may need to reconsider your dataset structure to avoid having this issue - you'd need to ask a separate question with a lot more details though to find out the best solution there.
Upvotes: 1