Ray Kiddy
Ray Kiddy

Reputation: 3621

split() on tabs may not split on every tab

I have a file of one line:

 $ od -c testData.txt 
 0000000    6   7   7   7   1   0  \t   0  \t   1  \t   L   P   A   Y  \t
 0000020    F   6   3   5   P   3   B  \t   L   P   A   Y   0   0   0   0
 0000040    1  \t   F   R   M  \t   H   O   U   S   T   O   N       G   R
 0000060    O   U   P       (   a   k   a       C   O   R   P   O   R   A
 0000100    T   E       A   D   V   O   C   A   T   E   S       I   N   C
 0000120    .   )       T   H   E  \t  \t  \t  \t   S   a   c   r   a   m
 0000140    e   n   t   o  \t   C   A  \t   9   5   8   1   4   -   2   8
 0000160    2   5  \t   (   9   1   6   )       4   4   7   -   9   8   8
 0000200    4  \t  \t   6   4   9   9   .   9   8  \t   1   7   .   1   9
 0000220   \t   0  \t  \t   6   5   1   7   .   1   7  \t   3   9   3   0
 0000240    9   .   2   3  \t   N  \t  \t  \t  \r  \n                    
 0000253

I have a script which does the one thing:

 #!/usr/bin/perl
 $line = <STDIN>;
 @p = split '\t', $line;
 chomp(@p);
 for ($idx = 0; $idx < scalar(@p); $idx++) { print $idx.": \"".$p[$idx]."\"\n"; }
 exit(0);

I am on Mac OS X 10.8.5 and using the stock perl (perl 5, version 12, subversion 4 (v5.12.4) built for darwin-thread-multi-2level).

If I do not pipe the data through col then I see a glitch from the line-ending. If I do then the split() function will ignore a few tabs. Not all, just a few. Really. Annoying.

 $ ./testSplit < testData.txt 
 0: "677710"
 1: "0"
 2: "1"
 3: "LPAY"
 4: "F635P3B"
 5: "LPAY00001"
 6: "FRM"
 7: "HOUSTON GROUP (aka CORPORATE ADVOCATES INC.) THE"
 8: ""
 9: ""
 10: ""
 11: "Sacramento"
 12: "CA"
 13: "95814-2825"
 14: "(916) 447-9884"
 15: ""
 16: "6499.98"
 17: "17.19"
 18: "0"
 19: ""
 20: "6517.17"
 21: "39309.23"
 22: "N"
 23: ""
 24: ""
 "5: "
 $

See slight glitch at last line above.

 $ col < testData.txt | ./testSplit 
 0: "677710"
 1: "0"
 2: "1"
 3: "LPAY"
 4: "F635P3B LPAY00001"
 5: "FRM"
 6: "HOUSTON GROUP (aka CORPORATE ADVOCATES INC.) THE"
 7: ""
 8: ""
 9: ""
 10: "Sacramento"
 11: "CA"
 12: "95814-2825"
 13: "(916) 447-9884"
 14: ""
 15: "6499.98 17.19"
 16: "0"
 17: ""
 18: "6517.17 39309.23"
 19: "N"
 $

What the heck!

Upvotes: 0

Views: 138

Answers (1)

cjm
cjm

Reputation: 62109

Actually, it's col that's ignoring the tabs (it's converting some of them to spaces):

$ diff -u <(od -c testData.txt) <(col <testData.txt | od -c)
--- /dev/fd/63  2013-11-10 00:06:29.532490383 -0600
+++ /dev/fd/62  2013-11-10 00:06:29.532490383 -0600
@@ -1,12 +1,12 @@
 0000000   6   7   7   7   1   0  \t   0  \t   1  \t   L   P   A   Y  \t
-0000020   F   6   3   5   P   3   B  \t   L   P   A   Y   0   0   0   0
+0000020   F   6   3   5   P   3   B       L   P   A   Y   0   0   0   0
 0000040   1  \t   F   R   M  \t   H   O   U   S   T   O   N       G   R
 0000060   O   U   P       (   a   k   a       C   O   R   P   O   R   A
 0000100   T   E       A   D   V   O   C   A   T   E   S       I   N   C
 0000120   .   )       T   H   E  \t  \t  \t  \t   S   a   c   r   a   m
 0000140   e   n   t   o  \t   C   A  \t   9   5   8   1   4   -   2   8
 0000160   2   5  \t   (   9   1   6   )       4   4   7   -   9   8   8
-0000200   4  \t  \t   6   4   9   9   .   9   8  \t   1   7   .   1   9
+0000200   4  \t  \t   6   4   9   9   .   9   8       1   7   .   1   9
-0000220  \t   0  \t  \t   6   5   1   7   .   1   7  \t   3   9   3   0
+0000220  \t   0  \t  \t   6   5   1   7   .   1   7       3   9   3   0
-0000240   9   .   2   3  \t   N  \t  \t  \t  \r  \n
+0000240   9   .   2   3  \t   N  \n
-0000253
+0000247

To fix your actual problem, you need to remove the \r character. chomp doesn't do that. For field 25, you're essentially doing print qq{25: "\r"\n}. The \r moves the cursor back to the left margin, causing the " to overwrite the 2.

Here's a cleaned up version:

#!/usr/bin/perl
use strict;
use warnings;

binmode STDIN, ':crlf';

my $line = <STDIN>;
chomp($line);
my @p = split /\t/, $line, -1;
for my $idx (0 .. $#p) { print $idx.": \"".$p[$idx]."\"\n"; }
exit(0);

Major changes:

  1. binmode STDIN, ':crlf' turns on CRLF->LF translation when reading. This gets rid of the \r.
  2. Chomp the line, not the individual parts. This isn't fatal, because chomp only removes the line-ending character, but it's a waste of time to chomp all the elements of @p when what you really wanted was chomp $line.
  3. Adding -1 to split. This keeps the empty fields at the end. Otherwise, the output would stop with field 22. (The empty fields used to be displayed because the trailing \r meant the last one wasn't empty.)
  4. Changing the for loop to use 0 .. $#p isn't necessary; it's just simpler.
  5. Using strict and warnings is always a good idea. This required inserting a number of my statements.

Upvotes: 6

Related Questions