Jacob Wegelin
Jacob Wegelin

Reputation: 1306

perl regex negative-lookbehind detect file lacking final linefeed

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).

> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb'   | test -n "$(tail -c1)" && echo pathological last line
pathological last line 
>

One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.

(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)

> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb'   | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
>

Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:

> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb'   | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
>

Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.

Upvotes: 0

Views: 143

Answers (3)

Jacob Wegelin
Jacob Wegelin

Reputation: 1306

The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp. (I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)

Thanks to the responses, here is an improved draft of the script:

#!/usr/bin/perl
use strict; use warnings; use 5.18.2;
# Loop slurp: 
$/ = undef;     # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
    s/^\s*$/\n/mg;          # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
    s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
    s/^[\n]+//;             # first line is visible or "nontrivial."
    s/[\n]+\z/\n/;      # last  line is visible or "nontrivial."
    print STDOUT;
    print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF. 
}

And here is how it works, when named zz.pl. First a messy file, then how it looks after zz.pl gets through with it:

bash: printf '  \n \r   \naaa\n \t \n  \n  \nbb\n\n\n\n    \t' 


aaa



bb



        bash: 
bash: 
bash: printf '  \n \r   \naaa\n \t \n  \n  \nbb\n\n\n\n    \t' | zz.pl
aaa

bb
bash: 

Upvotes: 0

Thomas Blankenhorn
Thomas Blankenhorn

Reputation: 256

Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.

 perl -le'while (!eof()) { $previous = getc(\*ARGV) } 
          if ($previous ne "\n") { print "pathological last line!" }'

PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:

perl -le 'use File::ReadBackwards; 
          my $bw = File::ReadBackwards->new(shift @ARGV);
          print "pathological last line" if substr($bw->readline, -1) ne "\n"'

That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.

Upvotes: 0

ikegami
ikegami

Reputation: 386541

/(?<=\n)$/ is a weird and expensive way of doing /\n$/.

/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.

A few approaches:

perl -n0777e'print "pathological last line\n" if !/\n\z/'

perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'

perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'

perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'

The last solution avoids slurping the entire file.


Why does my regex always match, even when the end-of-file is preceded by newline?

Because you mistakenly think that $ only matches at the end of the string. Use \z for that.

Upvotes: 2

Related Questions