Ilia Ross
Ilia Ross

Reputation: 13412

Most efficient way to read first and last number of bytes of a file in Perl

What would be the most efficient way to read the beginning and the end of a huge file (binary or text) in given number of bytes?

Example:


=head2 read_file_contents(file, limit)

Given a filename, returns its partial content in bytes, with number of truncated bytes

=cut
sub read_file_contents
{
    my ($file, $limit) = @_;
    my $rv;

    # Starting and ending number of bytes to read
    $limit = $limit / 2;

    # Reading beginning of file
    my $start;

    # code goes here

    # Reading end of a file
    my $end;

    # code goes here

    $rv = $start . "\n\n\n truncated N bytes of data \n\n\n" . $end;

    return $rv;
}

The main goal is to be able quickly, without processing the whole file, fetch its start and end bytes effectively. It is not a problem to read a whole file and then substr it the needed way but it is not going to work fine with files of size 10 Gb+.

Any solutions would be appreciated.

Upvotes: 3

Views: 434

Answers (3)

Ilia Ross
Ilia Ross

Reputation: 13412

Thanks @DaveMitchell for the insight. Thanks to @ikegami for useful tips. This is what I eventually came up with.

It can be useful for tailing logs (returning reversed output) or previewing files of any size efficiently.

Example:

use Fcntl qw(SEEK_END);

=head2 read_file_contents_limit(file, limit, [opts])

Given a filename, returns its partial content with limit in bytes,
by default collected from both beginning and end of the file 
* Options is a hash reference with
  - [head]      : Head the file only and just return beginning bytes
  - [tail]      : Tail the file only and return ending bytes
  - [reverse]   : Reverse output
  - [nomessage] : Remove truncated message

=cut
sub read_file_contents_limit
{
    my ($file, $limit, $opts) = @_;
    my $data;
    my $reverse = sub {
        return join("\n", reverse split("\n", $_[0]));
    };
    my $nonulls = sub {
        $_[0] =~ s/[^[:print:]\n\r\t]/\ /g;
        return $_[0];
    };

    # Is binary file
    my $binary = -B $file;

    # Open file
    open(my $fh, "<", $file) || return undef;
    binmode $fh if ($binary);

    # Get file size
    my $fsize = -s $file;

    # Return full file if requested limit fits the size
    if ($fsize <= $limit) {
        my $full;
        read($fh, $full, $fsize);
        $full = &$nonulls($full)
          if ($binary);
        $full = &$reverse($full)
          if ($opts->{'reverse'});
        return $full;
    }

    # Starting and ending number of bytes to read
    my $split = !$opts->{'head'} && !$opts->{'tail'};
    $limit = $limit / 2 if ($split);

    # Create truncated message
    my $truncated = $fsize - $limit;
    $truncated -= $limit if ($split);
    $truncated = "\n\n\n[--- truncated ${truncated} bytes of data ---]\n\n\n";
    $truncated = undef if ($opts->{'nomessage'});

    # Reading beginning of file
    my $head;
    read($fh, $head, $limit);

    # Return beginning only if requested
    if ($opts->{'head'}) {
        $head = &$nonulls($head)
          if ($binary);
        $head = &$reverse($head)
          if ($opts->{'reverse'});
        return $head . $truncated;
    }

    # Reading end of file
    my $tail;
    seek($fh, -$limit, SEEK_END);
    read($fh, $tail, $limit);

    # Return ending only if requested
    if ($opts->{'tail'}) {
        $tail = &$nonulls($tail)
          if ($binary);
        $tail = &$reverse($tail)
          if ($opts->{'reverse'});
        return $truncated . $tail;
    }

    # Return combined data
    $data = $head . $truncated . $tail;

    # Remove nulls for binary
    $data = &$nonulls($data)
      if ($binary);

    # Reverse output if needed
    $data = &$reverse($data)
      if ($opts->{'reverse'});
    return $data;
}

The example of how it can be used to tail a log file and show latest log lines on the top.

Usage:

say read_file_contents_limit('/var/webmin/miniserv.log', 2000, {'tail', 1, 'reverse', 1});

Output:

[--- truncated 1092091 bytes of data ---]


10.211.55.2 - root [14/Dec/2020:16:47:37 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:37 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:36 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:36 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:35 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:35 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:34 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:34 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:30 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:30 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:24 +0000] "GET /favicon.ico HTTP/1.1" 200 15086

Upvotes: 3

Dave Mitchell
Dave Mitchell

Reputation: 2403

open(my $fh, "<", $file) or die "...";
my $r = read($fh, $start, $limit) or die "...";
die "short read\n" unless $r == $limit;
seek($fh, -$limit, 2) or die "...";
$r = read($fh, $end, $limit) or die "...";

Upvotes: 3

leszekrabka
leszekrabka

Reputation: 47

Check file size, then seek near the end...

https://perldoc.perl.org/functions/seek

Upvotes: 0

Related Questions