Reputation: 2076
I have a huge file provided by a third party, which appears to have been generated in a Windows/DOS-like environment. The last line of the file contains a ^Z
character. I noticed this when I looked at the processed file and the last line contained a ^Z
. I added some logic to skip this line from the input and it was working fine until I changed my code to take the input from stdin
as opposed to a file.
Here is a simpler illustration of this issue. When I do a line count on a single file stream with and without ^Z
skipping, it reports the correct values:
unzip -j -p -qq file1.zip | perl -nle 'print' | wc -l
3451
unzip -j -p -qq file2.zip | perl -nle 'print' | wc -l
3451
unzip -j -p -qq file1.zip | perl -nle 'next if /^\cZ/; print' | wc -l
3450
unzip -j -p -qq file2.zip | perl -nle 'next if /^\cZ/; print' | wc -l
3450
Now when I try to process both files at once, I lose one record. I am guessing this is something to do with the ^Z
character but I cannot figure out what I can do about it:
unzip -j -p -qq '*.zip' | perl -nle 'print' | wc -l
6901 ## this should have been 6902
unzip -j -p -qq '*.zip' | perl -nle 'next if /^\cZ/; print' | wc -l
6899 ## this should have been 6900
These files are huge (each 20+GB) and they are to be read in groups of 3-6 files so I wanted to avoid processing them one by one and then concatenate later. Any thoughts on how to avoid the ^Z
character without running into the above issue?
I am on a Linux machine. Btw, opening the file in vim
does not display the last record (i.e., ^Z
) and setting set ff=unix
did not change this either. So vim
reports 3450
lines for the single unzipped file and 6900
for the combined unzipped files.
Thanks!
Upvotes: 2
Views: 366
Reputation: 386406
Since the ^Z
isn't followed by a line ending, unzip
is producing
file1:1
file1:2
file1:3
^Zfile2:1
file2:2
file2:3
^Z
so you delete the first line of the second file. You could simply remove the ^Z
instead of the entire line.
perl -pe's/^\cZ//'
That said, unzip -a
is designed for exactly this situation. Not only will it strip the ^Z
for you, it will also fix the line endings if necessary.
$ unzip -j -p -qq z.zip a.txt | od -c
0000000 a b c \r \n d e f \r \n 032
0000013
$ unzip -j -p -qq z.zip b.txt | od -c
0000000 g h i \r \n j k l \r \n 032
0000013
$ unzip -j -p -qq z.zip | od -c
0000000 a b c \r \n d e f \r \n 032 g h i \r \n
0000020 j k l \r \n 032
0000026
$ unzip -j -p -qq -a z.zip | od -c
0000000 a b c \n d e f \n g h i \n j k l \n
0000020
Upvotes: 4