Reputation: 75
I'm quite new to Perl and I'm having immense difficulty writing a Perl script that will successfully parse a structured text file.
I have a collection of files that look like this:
name:
John Smith
occupation:
Electrician
date of birth:
2/6/1961
hobbies:
Boating
Camping
Fishing
And so on. The field name is always followed by a colon, and all the data associated with those fields is always indented by a single tab (\t).
I would like to create a hash that will directly associate the field contents with the field name, like this:
$contents{$name} = "John Smith"
$contents{$hobbies} = "Boating, Camping, Fishing"
Or something along those lines.
So far I've been able to get all the field names into a hash by themselves, but I've not had any luck wrangling the field data into a form that can be nicely stored in a hash. Clearly substituting/splitting newlines followed by tabs won't work (I've tried, somewhat naively). I've also tried a crude lookahead where I create a duplicate array of lines from the file and using that to figure out where the field boundaries are, but it's not that great in terms of memory consumption.
FWIW, currently I'm going through the file line by line, but I'm not entirely convinced that this is the best solution. Is there any way to do this parsing in a straightforward manner?
Upvotes: 3
Views: 3309
Reputation: 1814
This text file is actually quite close to yaml. And its not difficult to convert it into a valid yaml file:
Once you have a yaml file you can use YAML::Tiny or another module to parse it, which leads to cleaner code:
#!/usr/bin/perl
use strict;
use warnings;
use YAML::Tiny;
use Data::Dumper;
convert( './data.yaml', 'output.yaml' );
parse('output.yaml');
sub parse {
my $yaml = shift;
my $yamlobj = YAML::Tiny->read($yaml);
my $name = $yamlobj->[0]->{name}[0];
my $occ = $yamlobj->[0]{occupation}[0];
my $birth = $yamlobj->[0]{'date of birth'}[0];
my $hobbies = $yamlobj->[0]{hobbies};
my $hobbiestring = join ", ", @$hobbies;
my $contents = {
name => $name,
occupation => $occ,
birth => $birth,
hobbies => $hobbiestring,
};
print "#RESULT:\n\n";
print Dumper($contents);
}
sub convert {
my ( $input, $output ) = @_;
open my $infh, '<', $input or die "$!";
open my $outfh, '>', $output or die "$!";
while ( my $line = <$infh> ) {
$line =~ s/^\s+\K$/-/g;
print $outfh ($line);
}
}
Upvotes: 2
Reputation: 5159
Reading the file line by line is a good way to go. Here I am creating a hash of array references. This is how you would just read one file. You could read each file this way and put the hash of arrays into a hash of hashes of array.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %contents;
my $key;
while(<DATA>){
chomp;
if ( s/:\s*$// ) {
$key = $_;
} else {
s/^\s+//g; # remove extra whitespace
push @{$contents{$key}}, $_;
}
}
print Dumper \%contents;
__DATA__
name:
John Smith
occupation:
Electrician
date of birth:
2/6/1961
hobbies:
Boating
Camping
Fishing
Output:
$VAR1 = {
'occupation' => [
'Electrician'
],
'hobbies' => [
'Boating',
'Camping',
'Fishing'
],
'name' => [
'JohnSmith'
],
'date of birth' => [
'2/6/1961'
]
};
Upvotes: 6