Jean-Denis Muys
Jean-Denis Muys

Reputation: 6842

How can I parse a string into a hash using keywords in Perl?

I have a string where different predefined keywords introduce different data. Is there a way to do that using clever use of regexp, or something? Here is an example:

Keywords can be "first name: " and "last name: ". Now I want to parse:

"character first name: Han last name: Solo"

into

{ "first name: " => "Han ", "last name: " => "Solo" }

Of course, the order of the keywords in the input string is not fixed. This should also work on :

"character last name: Solo first name: Han"

I understand there are issues to be raised with spaces and so on. I'll ignore them here.

I know how to solve this problem looping on the different keywords, but I don't find that very pretty.

Split almost fits the bill. Its only problem is that it returns an array and not a hash, so I don't know which is the first name or the last name.

My example is somewhat misleading. Here is another one:

my @keywords = ("marker 1", "marker 2", "marker 3");
my $rawString = "beginning marker 1 one un marker 2 two deux marker 3 three trois and the rest";
my %result;
# <grind result>
print Dumper(\%result);

will print:

$VAR1 = {
      'marker 2' => ' two deux ',
      'marker 3' => ' three trois and the rest',
      'marker 1' => ' one un '
    };

Upvotes: 2

Views: 795

Answers (6)

Eric Strom
Eric Strom

Reputation: 40142

Here is a solution using split (with separator retention mode) that is extensible with other keys:

use warnings;
use strict;

my $str = "character first name: Han last name: Solo";

my @keys = ('first name:', 'last name:');

my $regex = join '|' => @keys;

my ($prefix, %hash) = split /($regex)\s*/ => $str;

print "$_ $hash{$_}\n" for keys %hash;

which prints:

last name: Solo
first name: Han 

To handle keys that contain regex metacharacters, replace the my $regex = ... line with:

 my $regex = join '|' => map {quotemeta} @keys;

Upvotes: 7

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118128

The following loops over the string once to find matches (after normalizing the string). The only way you can avoid the loop is if each keyword can only appear once in the text. If that were the case, you could write

my %matches = $string =~ /($re):\s+(\S+)/g;

and be done with it.

The script below deals with possible multiple occurrences.

#!/usr/bin/perl

use strict; use warnings;

use File::Slurp;
use Regex::PreSuf;

my $re = presuf( 'first name', 'last name' );

my $string = read_file \*DATA;
$string =~ s/\n+/ /g;

my %matches;

while ( $string =~ /($re):\s+(\S+)/g ) {
    push @{ $matches{ $1 } }, $2;
}

use Data::Dumper;
print Dumper \%matches;

__DATA__
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore character first name: Han last
name: Solo et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud character last name: Solo first name: Han exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute
irure dolor in reprehenderit in voluptate velit esse cillum
character last name: Solo first name: Han dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum

Upvotes: 3

Zaid
Zaid

Reputation: 37146

use strict;
use warnings;
use Data::Dump 'dump';   # dump allows you to see what %character 'looks' like

my %character;
my $nameTag = qr{(?:first|last) name:\s*};

# Use an array slice to populate the hash in one go
@character{ ($1, $3) } = ($2, $4) if $string =~ /($nameTag)(.+)($nameTag)(.+)/;

dump %character; # returns ("last name: ", "Solo", "first name: ", "Han ")

Upvotes: 2

daxim
daxim

Reputation: 39158

This works.

use 5.010;
use Regexp::Grammars;
my $parser = qr{
        (?:
            <[Name]>{2}
        )
        <rule: Name>
            ((?:fir|la)st name: \w+)
}x;

while (<DATA>) {
    /$parser/;
    use Data::Dumper; say Dumper $/{Name};
}

__DATA__
character first name: Han last name: Solo
character last name: Solo first name: Han

Output:

$VAR1 = [
          ' first name: Han',
          ' last name: Solo'
        ];

$VAR1 = [
          ' last name: Solo',
          ' first name: Han'
        ];

Upvotes: 2

Jim Garrison
Jim Garrison

Reputation: 86774

This is possible IF:

1) You can identify a small set of regexes that can pick out the tags 2) The regex for extracting the value can be written so that it picks out only the value and ignores following extraneous data, if any, between the end of the value and the start of the next tag.

Here's a sample of how to do it with a very simple input string. This is a debug session:

  DB<14> $a = "a 13 b 55 c 45";
  DB<15> %$b = $a =~ /([abc])\s+(\d+)/g;
  DB<16> x $b
0  HASH(0x1080b5f0)
   'a' => 13
   'b' => 55
   'c' => 45

Upvotes: 0

Colin Fine
Colin Fine

Reputation: 3364

Use Text::ParseWords. It probably doesn't do all of what you want, but you're much better building on it than trying to solve the whole problem from scratch.

Upvotes: -1

Related Questions