Reputation: 2652
I'm using Perl (5.8.8, don't ask) and I'm looking at a serialised binary file that I want to parse and snaffle information from.
The format is as follows:
My current code somewhat naively skips the first 8 bytes, then reads byte per byte until it hits a null and then does very specific parsing.
sub readGroupsFile {
my %index;
open (my $fh, "<:raw", "groupsfile");
seek($fh, 8, 0);
while (read($fh, my $userID, 7)) {
$index{$userID} = ();
seek($fh, 18, 1);
my $groups = "";
while (read($fh, my $byte, 1)) {
last if (ord($byte) == 0);
$groups .= $byte;
}
my @grouplist = split("\n", $groups);
$index{$userID} = \@grouplist;
}
close($fh);
return \%index;
}
Good news? It works.
However, I think it's not very elegant, and wonder if I can use the 2-byte number that specifies the amount of items to follow to my advantage to speed up the parsing. I have no idea why else it would be there.
I think unpack()
and its templates may provide an answer, but I can't figure out how it can work with variable-length arrays of strings with their own variable lengths.
Upvotes: 3
Views: 470
Reputation: 66873
Here are two ways to reduce the number of hard-coded specifics, based on data description; one reads by those null bytes (then changes back to newlines), the other unpack
s lines with nuls.
Set $/
variable to the null byte, and read first 4 (four) such "lines." You get your user ID there, and then the last such "line" read is the number of items that follows. Restore $/
to newline and read that list, using normal readline
(aka <>
). Repeat, if this pattern indeed repeats.
use warnings;
use strict;
use feature 'say';
my $file = shift or die "Usage: $0 file\n"; # a_file_with_nuls.txt
open my $fh, '<', $file or die "Can't open $file: $!";
my ($user_id, $num_items);
while (not eof $fh) {
READ_BY_NUL: {
my $num_of_nul_lines = 4;
local $/ = "\x00";
my $line;
for my $i (1..$num_of_nul_lines) {
$line = readline $fh;
chop $line;
if ($i == 2) {
$user_id = $line;
}
}
$num_items = $line; # last nul-terminated "line"
}
say "Got: user-id = |$user_id|, and number-of-items = |$num_items|";
my @items;
for (1..$num_items) {
my $line = readline $fh;
chomp $line;
push @items, $line;
}
say for @items;
};
Since $/
is set using local in READ_BY_NUL
block, its previous value is restored out of the block.
The output is as expected, but please add checks. Also, one can imagine errors on which it would make sense to recover (for example: actual number of items falls short of the given number).
The whole thing is in a while
with a manual check (and termination) using eof, per assumption that the pattern four-nuls + number-of-lines indeed repeats (a little unclear from the question).
I test with a file made by
perl -wE'say "toss\x00user-id\x00this-too\x003\x00item-1\nitem2\nitem 3"'
> a_file_with_nuls.txt
which is then appended multiple times to give us something for that while
loop.
Finally, make that read <:raw
on systems which need it and unpack
as needed. See below.
As stated in the question, (some?) data is binary so what is read above need be upack
-ed. That also means that there may be problems with reading up to null bytes -- how was that data written in the first place? It is possible for unfilled parts of those fixed-width fields to be filled exactly with nuls.
Another option altogether is to simply read lines, and unpack the first one (and then unpack
one line every time after the given number of lines, specified as "items," has been read).
open my $fh, '<:raw', $file or die "Can't open $file: $!";
my @items;
my $block_lines = 1;
while (my $line = <$fh>) {
chomp $line;
if ( $. % $block_lines == 0 ) {
my ($uid, $num_items) = unpack "x8 A7x x13 i3x", $line;
say "User-id: $uid, read $num_items lines for items";
$block_lines += 1 + $num_items;
}
else {
push @items, $line;
}
}
say for @items;
Here the number of bytes to skip (x8
and x13
) includes the zero.
This assumes that the number of "items" (lines) to read in every "block" may be different, and adds them up as it goes (plus the line with nuls, for total running $block_lines
) so to able to check when it is again at a line with nuls ($. % $block_lines == 0
)
It makes a few other (reasonable) assumptions for things which aren't specified. This has been checked only lightly, with some made up data.
Upvotes: 3
Reputation: 385546
You have no idea how much to read, so reading in the whole file at once will get you the best speed results.
{
my $file = do { local $/; <> };
$file =~ s/^.{8}//s
or die("Bad data");
while (length($file)) {
$file =~ s/^([^\0]*)\0[^\0]*\0[^\0]*\0([^\0]*)\0//
or die("Bad data");
my $user_id = $1;
my @items = split(/\n/, $2, -1);
...
}
}
By using a buffer, you can get most of the benefits of reading in the whole file at once without actually reading the whole file in at once, but it will make the code more complicated.
{
my $buf = '';
my $not_eof = 1;
my $reader = sub {
$not_eof &&= read(\*ARGV, $buf, 1024*1024, length($buf));
die($!) if !defined($not_eof);
return $not_eof;
};
while ($buf !~ s/^.{8}//s) {
$reader->()
or die("Bad data");
}
while (length($buf) || $reader->()) {
my $user_id;
my @items;
while (1) {
if ($buf =~ s/^([^\0]*)\0[^\0]*\0[^\0]*\0([^\0]*)\0//) {
$user_id = $1;
@items = split(/\n/, $2, -1);
last;
}
$reader->()
or die("Bad data");
}
...
}
}
Upvotes: 2