Reputation: 13412
What would be the most efficient way to read the beginning and the end of a huge file (binary or text) in given number of bytes?
Example:
=head2 read_file_contents(file, limit)
Given a filename, returns its partial content in bytes, with number of truncated bytes
=cut
sub read_file_contents
{
my ($file, $limit) = @_;
my $rv;
# Starting and ending number of bytes to read
$limit = $limit / 2;
# Reading beginning of file
my $start;
# code goes here
# Reading end of a file
my $end;
# code goes here
$rv = $start . "\n\n\n truncated N bytes of data \n\n\n" . $end;
return $rv;
}
The main goal is to be able quickly, without processing the whole file, fetch its start and end bytes effectively. It is not a problem to read a whole file and then substr
it the needed way but it is not going to work fine with files of size 10 Gb+.
Any solutions would be appreciated.
Upvotes: 3
Views: 434
Reputation: 13412
Thanks @DaveMitchell for the insight. Thanks to @ikegami for useful tips. This is what I eventually came up with.
It can be useful for tailing logs (returning reversed output) or previewing files of any size efficiently.
Example:
use Fcntl qw(SEEK_END);
=head2 read_file_contents_limit(file, limit, [opts])
Given a filename, returns its partial content with limit in bytes,
by default collected from both beginning and end of the file
* Options is a hash reference with
- [head] : Head the file only and just return beginning bytes
- [tail] : Tail the file only and return ending bytes
- [reverse] : Reverse output
- [nomessage] : Remove truncated message
=cut
sub read_file_contents_limit
{
my ($file, $limit, $opts) = @_;
my $data;
my $reverse = sub {
return join("\n", reverse split("\n", $_[0]));
};
my $nonulls = sub {
$_[0] =~ s/[^[:print:]\n\r\t]/\ /g;
return $_[0];
};
# Is binary file
my $binary = -B $file;
# Open file
open(my $fh, "<", $file) || return undef;
binmode $fh if ($binary);
# Get file size
my $fsize = -s $file;
# Return full file if requested limit fits the size
if ($fsize <= $limit) {
my $full;
read($fh, $full, $fsize);
$full = &$nonulls($full)
if ($binary);
$full = &$reverse($full)
if ($opts->{'reverse'});
return $full;
}
# Starting and ending number of bytes to read
my $split = !$opts->{'head'} && !$opts->{'tail'};
$limit = $limit / 2 if ($split);
# Create truncated message
my $truncated = $fsize - $limit;
$truncated -= $limit if ($split);
$truncated = "\n\n\n[--- truncated ${truncated} bytes of data ---]\n\n\n";
$truncated = undef if ($opts->{'nomessage'});
# Reading beginning of file
my $head;
read($fh, $head, $limit);
# Return beginning only if requested
if ($opts->{'head'}) {
$head = &$nonulls($head)
if ($binary);
$head = &$reverse($head)
if ($opts->{'reverse'});
return $head . $truncated;
}
# Reading end of file
my $tail;
seek($fh, -$limit, SEEK_END);
read($fh, $tail, $limit);
# Return ending only if requested
if ($opts->{'tail'}) {
$tail = &$nonulls($tail)
if ($binary);
$tail = &$reverse($tail)
if ($opts->{'reverse'});
return $truncated . $tail;
}
# Return combined data
$data = $head . $truncated . $tail;
# Remove nulls for binary
$data = &$nonulls($data)
if ($binary);
# Reverse output if needed
$data = &$reverse($data)
if ($opts->{'reverse'});
return $data;
}
The example of how it can be used to tail a log file and show latest log lines on the top.
Usage:
say read_file_contents_limit('/var/webmin/miniserv.log', 2000, {'tail', 1, 'reverse', 1});
Output:
[--- truncated 1092091 bytes of data ---]
10.211.55.2 - root [14/Dec/2020:16:47:37 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:37 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:36 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:36 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:35 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:35 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:34 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:34 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:30 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
10.211.55.2 - root [14/Dec/2020:16:47:30 +0000] "GET /debug.cgi HTTP/1.1" 200 3662
10.211.55.2 - root [14/Dec/2020:16:47:24 +0000] "GET /favicon.ico HTTP/1.1" 200 15086
Upvotes: 3
Reputation: 2403
open(my $fh, "<", $file) or die "...";
my $r = read($fh, $start, $limit) or die "...";
die "short read\n" unless $r == $limit;
seek($fh, -$limit, 2) or die "...";
$r = read($fh, $end, $limit) or die "...";
Upvotes: 3
Reputation: 47
Check file size, then seek near the end...
https://perldoc.perl.org/functions/seek
Upvotes: 0