masterial
masterial

Reputation: 2216

How to extract into hash

Hey, I am not sure why my code does not work. I am trying to extract some information from html file which contains.

    Junk id="i_0100_1" alt="text1, text2 | text3" 
Junk Junk id="i_0100_2" alt="text1, text2 | text3"

I am using this to do it.

my $file = "page.html";

open (LOGFILE, $file);
my %hash;
while (my $line = <LOGFILE>)     
{ 
    %hash = $line =~ /^\s*id="([^"]*)"\s*alt="([^"]*)"/mg;
    print $hash{'id'};
}   
close LOGFILE;

What am I missing?

Upvotes: 0

Views: 324

Answers (5)

masterial
masterial

Reputation: 2216

This did the trick:

my $file = "page.htm";

open (LOGFILE, $file);
my %hash;
while (my $line = <LOGFILE>)     
{ 
    %hash = $line =~ /\s*id="([^"]*)"\s*alt="([^"]*)"/;
    for my $key ( keys %hash ) {
        my $value = $hash{$key};
        print "$key\n$value\n";
    }
}   
close LOGFILE;

The problem was with the hash output and the regex definition. Thanks to eugene, michael and ish. :)

Upvotes: 0

Michael Carman
Michael Carman

Reputation: 30851

In addition to Axeman's suggestions (the most important of which is to not parse HTML yourself):

  1. The ^ anchor will prevent your regex from matching since "id" isn't at the beginning of the line.
  2. You're resetting the data in %hash with each assignment, which probably isn't what you want.
  3. You're printing the value for key "id" but you don't store that in the hash. What you store (or would, if the pattern ever matched) is the value of the id attribute.

Upvotes: 2

Axeman
Axeman

Reputation: 29854

  1. Per other suggestion: You might not be opening the file. Check the return or use autodie.
  2. The scanned HTML may not be in lower case. Use the i regex flag.
  3. Per the rules of HTML, not all attribute values need to be quoted.
  4. Also per the rules of HTML, the '=' does not have to come right after the attribute name or right before the value.
  5. They might not always occur in the same order or adjacent to each other.
  6. You're using regexes to parse HTML!

#6 is a summary of the problems with 3-5. The solution I suggest is use HTML::Parser or HTML::TreeBuilder

Upvotes: 4

Ish
Ish

Reputation: 29606

You need not require ^\s* in the beginning

try this id\=\"(.*)\"\salt=\"(.*)\"

Demo http://rubular.com/r/ySG0XO5jbJ

EDIT

Try removing these modifiers /mg

Upvotes: 1

Eugene Yarmash
Eugene Yarmash

Reputation: 150188

You should always check the return value from opening a file:

open LOGFILE, $file or die $!;

Also, the ^ anchor is probably unnecessary in the regex.

Upvotes: 2

Related Questions