split() on tabs may not split on every tab

Question

I have a file of one line:

 $ od -c testData.txt 
 0000000    6   7   7   7   1   0  	   0  	   1  	   L   P   A   Y  	
 0000020    F   6   3   5   P   3   B  	   L   P   A   Y   0   0   0   0
 0000040    1  	   F   R   M  	   H   O   U   S   T   O   N       G   R
 0000060    O   U   P       (   a   k   a       C   O   R   P   O   R   A
 0000100    T   E       A   D   V   O   C   A   T   E   S       I   N   C
 0000120    .   )       T   H   E  	  	  	  	   S   a   c   r   a   m
 0000140    e   n   t   o  	   C   A  	   9   5   8   1   4   -   2   8
 0000160    2   5  	   (   9   1   6   )       4   4   7   -   9   8   8
 0000200    4  	  	   6   4   9   9   .   9   8  	   1   7   .   1   9
 0000220   	   0  	  	   6   5   1   7   .   1   7  	   3   9   3   0
 0000240    9   .   2   3  	   N  	  	  	  
  
                    
 0000253

I have a script which does the one thing:

 #!/usr/bin/perl
 $line = ;
 @p = split '	', $line;
 chomp(@p);
 for ($idx = 0; $idx < scalar(@p); $idx++) { print $idx.": "".$p[$idx].""
"; }
 exit(0);

I am on Mac OS X 10.8.5 and using the stock perl (perl 5, version 12, subversion 4 (v5.12.4) built for darwin-thread-multi-2level).

If I do not pipe the data through col then I see a glitch from the line-ending. If I do then the split() function will ignore a few tabs. Not all, just a few. Really. Annoying.

 $ ./testSplit < testData.txt 
 0: "677710"
 1: "0"
 2: "1"
 3: "LPAY"
 4: "F635P3B"
 5: "LPAY00001"
 6: "FRM"
 7: "HOUSTON GROUP (aka CORPORATE ADVOCATES INC.) THE"
 8: ""
 9: ""
 10: ""
 11: "Sacramento"
 12: "CA"
 13: "95814-2825"
 14: "(916) 447-9884"
 15: ""
 16: "6499.98"
 17: "17.19"
 18: "0"
 19: ""
 20: "6517.17"
 21: "39309.23"
 22: "N"
 23: ""
 24: ""
 "5: "
 $

See slight glitch at last line above.

 $ col < testData.txt | ./testSplit 
 0: "677710"
 1: "0"
 2: "1"
 3: "LPAY"
 4: "F635P3B LPAY00001"
 5: "FRM"
 6: "HOUSTON GROUP (aka CORPORATE ADVOCATES INC.) THE"
 7: ""
 8: ""
 9: ""
 10: "Sacramento"
 11: "CA"
 12: "95814-2825"
 13: "(916) 447-9884"
 14: ""
 15: "6499.98 17.19"
 16: "0"
 17: ""
 18: "6517.17 39309.23"
 19: "N"
 $

What the heck!

cjm · Accepted Answer

Actually, it's col that's ignoring the tabs (it's converting some of them to spaces):

$ diff -u <(od -c testData.txt) <(col



To fix your actual problem, you need to remove the 
 character.  chomp doesn't do that.  For field 25, you're essentially doing print qq{25: "
"
}.  The 
 moves the cursor back to the left margin, causing the " to overwrite the 2.

Here's a cleaned up version:

#!/usr/bin/perl
use strict;
use warnings;

binmode STDIN, ':crlf';

my $line = ;
chomp($line);
my @p = split /	/, $line, -1;
for my $idx (0 .. $#p) { print $idx.": "".$p[$idx].""
"; }
exit(0);


Major changes:


binmode STDIN, ':crlf' turns on CRLF->LF translation when reading.  This gets rid of the 
.
Chomp the line, not the individual parts.  This isn't fatal, because chomp only removes the line-ending character, but it's a waste of time to chomp all the elements of @p when what you really wanted was chomp $line.
Adding -1 to split.  This keeps the empty fields at the end.  Otherwise, the output would stop with field 22.  (The empty fields used to be displayed because the trailing 
 meant the last one wasn't empty.)
Changing the for loop to use 0 .. $#p isn't necessary; it's just simpler.
Using strict and warnings is always a good idea.  This required inserting a number of my statements.

split() on tabs may not split on every tab

Answers (1)

Related Questions