Reputation: 3621
I have a file of one line:
$ od -c testData.txt
0000000 6 7 7 7 1 0 \t 0 \t 1 \t L P A Y \t
0000020 F 6 3 5 P 3 B \t L P A Y 0 0 0 0
0000040 1 \t F R M \t H O U S T O N G R
0000060 O U P ( a k a C O R P O R A
0000100 T E A D V O C A T E S I N C
0000120 . ) T H E \t \t \t \t S a c r a m
0000140 e n t o \t C A \t 9 5 8 1 4 - 2 8
0000160 2 5 \t ( 9 1 6 ) 4 4 7 - 9 8 8
0000200 4 \t \t 6 4 9 9 . 9 8 \t 1 7 . 1 9
0000220 \t 0 \t \t 6 5 1 7 . 1 7 \t 3 9 3 0
0000240 9 . 2 3 \t N \t \t \t \r \n
0000253
I have a script which does the one thing:
#!/usr/bin/perl
$line = <STDIN>;
@p = split '\t', $line;
chomp(@p);
for ($idx = 0; $idx < scalar(@p); $idx++) { print $idx.": \"".$p[$idx]."\"\n"; }
exit(0);
I am on Mac OS X 10.8.5 and using the stock perl (perl 5, version 12, subversion 4 (v5.12.4) built for darwin-thread-multi-2level).
If I do not pipe the data through col then I see a glitch from the line-ending. If I do then the split() function will ignore a few tabs. Not all, just a few. Really. Annoying.
$ ./testSplit < testData.txt
0: "677710"
1: "0"
2: "1"
3: "LPAY"
4: "F635P3B"
5: "LPAY00001"
6: "FRM"
7: "HOUSTON GROUP (aka CORPORATE ADVOCATES INC.) THE"
8: ""
9: ""
10: ""
11: "Sacramento"
12: "CA"
13: "95814-2825"
14: "(916) 447-9884"
15: ""
16: "6499.98"
17: "17.19"
18: "0"
19: ""
20: "6517.17"
21: "39309.23"
22: "N"
23: ""
24: ""
"5: "
$
See slight glitch at last line above.
$ col < testData.txt | ./testSplit
0: "677710"
1: "0"
2: "1"
3: "LPAY"
4: "F635P3B LPAY00001"
5: "FRM"
6: "HOUSTON GROUP (aka CORPORATE ADVOCATES INC.) THE"
7: ""
8: ""
9: ""
10: "Sacramento"
11: "CA"
12: "95814-2825"
13: "(916) 447-9884"
14: ""
15: "6499.98 17.19"
16: "0"
17: ""
18: "6517.17 39309.23"
19: "N"
$
What the heck!
Upvotes: 0
Views: 138
Reputation: 62109
Actually, it's col
that's ignoring the tabs (it's converting some of them to spaces):
$ diff -u <(od -c testData.txt) <(col <testData.txt | od -c)
--- /dev/fd/63 2013-11-10 00:06:29.532490383 -0600
+++ /dev/fd/62 2013-11-10 00:06:29.532490383 -0600
@@ -1,12 +1,12 @@
0000000 6 7 7 7 1 0 \t 0 \t 1 \t L P A Y \t
-0000020 F 6 3 5 P 3 B \t L P A Y 0 0 0 0
+0000020 F 6 3 5 P 3 B L P A Y 0 0 0 0
0000040 1 \t F R M \t H O U S T O N G R
0000060 O U P ( a k a C O R P O R A
0000100 T E A D V O C A T E S I N C
0000120 . ) T H E \t \t \t \t S a c r a m
0000140 e n t o \t C A \t 9 5 8 1 4 - 2 8
0000160 2 5 \t ( 9 1 6 ) 4 4 7 - 9 8 8
-0000200 4 \t \t 6 4 9 9 . 9 8 \t 1 7 . 1 9
+0000200 4 \t \t 6 4 9 9 . 9 8 1 7 . 1 9
-0000220 \t 0 \t \t 6 5 1 7 . 1 7 \t 3 9 3 0
+0000220 \t 0 \t \t 6 5 1 7 . 1 7 3 9 3 0
-0000240 9 . 2 3 \t N \t \t \t \r \n
+0000240 9 . 2 3 \t N \n
-0000253
+0000247
To fix your actual problem, you need to remove the \r
character. chomp
doesn't do that. For field 25, you're essentially doing print qq{25: "\r"\n}
. The \r
moves the cursor back to the left margin, causing the "
to overwrite the 2
.
Here's a cleaned up version:
#!/usr/bin/perl
use strict;
use warnings;
binmode STDIN, ':crlf';
my $line = <STDIN>;
chomp($line);
my @p = split /\t/, $line, -1;
for my $idx (0 .. $#p) { print $idx.": \"".$p[$idx]."\"\n"; }
exit(0);
Major changes:
binmode STDIN, ':crlf'
turns on CRLF->LF translation when reading. This gets rid of the \r
.chomp
only removes the line-ending character, but it's a waste of time to chomp all the elements of @p
when what you really wanted was chomp $line
.split
. This keeps the empty fields at the end. Otherwise, the output would stop with field 22. (The empty fields used to be displayed because the trailing \r
meant the last one wasn't empty.)for
loop to use 0 .. $#p
isn't necessary; it's just simpler.strict
and warnings
is always a good idea. This required inserting a number of my
statements.Upvotes: 6