Ecognium
Ecognium

Reputation: 2076

Ctrl+Z Character and EOF Issues With Pipes

I have a huge file provided by a third party, which appears to have been generated in a Windows/DOS-like environment. The last line of the file contains a ^Z character. I noticed this when I looked at the processed file and the last line contained a ^Z. I added some logic to skip this line from the input and it was working fine until I changed my code to take the input from stdin as opposed to a file.

Here is a simpler illustration of this issue. When I do a line count on a single file stream with and without ^Z skipping, it reports the correct values:

unzip -j -p -qq file1.zip | perl -nle 'print' | wc -l
3451
unzip -j -p -qq file2.zip | perl -nle 'print' | wc -l
3451

unzip -j -p -qq file1.zip | perl -nle 'next if /^\cZ/; print' | wc -l
3450
unzip -j -p -qq file2.zip | perl -nle 'next if /^\cZ/; print' | wc -l
3450

Now when I try to process both files at once, I lose one record. I am guessing this is something to do with the ^Z character but I cannot figure out what I can do about it:

unzip -j -p -qq '*.zip' | perl -nle 'print' | wc -l
6901  ## this should have been 6902 

unzip -j -p -qq '*.zip' | perl -nle 'next if /^\cZ/; print' | wc -l
6899  ## this should have been 6900 

These files are huge (each 20+GB) and they are to be read in groups of 3-6 files so I wanted to avoid processing them one by one and then concatenate later. Any thoughts on how to avoid the ^Z character without running into the above issue?

I am on a Linux machine. Btw, opening the file in vim does not display the last record (i.e., ^Z) and setting set ff=unix did not change this either. So vim reports 3450 lines for the single unzipped file and 6900 for the combined unzipped files.

Thanks!

Upvotes: 2

Views: 366

Answers (1)

ikegami
ikegami

Reputation: 386406

Since the ^Z isn't followed by a line ending, unzip is producing

 file1:1
 file1:2
 file1:3
 ^Zfile2:1
 file2:2
 file2:3
 ^Z

so you delete the first line of the second file. You could simply remove the ^Z instead of the entire line.

perl -pe's/^\cZ//'

That said, unzip -a is designed for exactly this situation. Not only will it strip the ^Z for you, it will also fix the line endings if necessary.

$ unzip -j -p -qq z.zip a.txt | od -c
0000000   a   b   c  \r  \n   d   e   f  \r  \n 032
0000013

$ unzip -j -p -qq z.zip b.txt | od -c
0000000   g   h   i  \r  \n   j   k   l  \r  \n 032
0000013

$ unzip -j -p -qq z.zip | od -c
0000000   a   b   c  \r  \n   d   e   f  \r  \n 032   g   h   i  \r  \n
0000020   j   k   l  \r  \n 032
0000026

$ unzip -j -p -qq -a z.zip | od -c
0000000   a   b   c  \n   d   e   f  \n   g   h   i  \n   j   k   l  \n
0000020

Upvotes: 4

Related Questions