Reputation: 403

Regex to match `rel` attribute of `img` element which only exists sometimes

I am facing a problem with a Perl regex. On an img element, I want to match the src attribute with a value starting with /file?id, and with any class and alt attribute. I want to ignore the rel attribute which sometimes exist and sometimes not exist like below:

<img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">    

<img  src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">

My question is how to handle the optional rel attribute.

I am trying this for the rel attribute match:

(?!\s+(rel)="([^"]+)")

It works when there is no rel attribute but fails when the img has a rel attribute.

Upvotes: 2

Answers (3)

daxim

Reputation: 39158

Web::Query wins!

use Web::Query 'wq';
my $html = <<'';
<html>
<img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess1.jpg">
<img class="bbc_img" src="/file?id=13167" alt="myimagess2.jpg">
<img src="/file?id=13168" class="bbc_img" >
<img src="/file?id=13169" alt="myimagess3.jpg">
<img  src="/foo" class="bbc_img" alt="myimagess.jpg4">

print for wq($html)->find('img[src^="/file?id="][class][alt]')->attr('src');
__END__
/file?id=13166
/file?id=13167

Learn from this: XPath is more powerful than CSS selectors, but CSS selectors are shorter.

Upvotes: 2

Borodin

Reputation: 126742

This is trivial to do using a proper HTML parser. This program demonstrates using HTML::TreeBuilder and the look_down method.

It is searching for all elements with:

A tag name of 'img'
A src attribute that matches the regex qr|^/file\?id=|
A class attribute that matches the null regex (i.e. a class attribute with any value)
An alt attribute that matches the null regex

You don't say what you want to do with the elements once you've found them. This code just uses as_HTML to display them.

use strict;
use warnings;

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder::XPath->new_from_file(\*DATA);
my @images = $html->look_down(
  _tag => 'img',
  src => qr|^/file\?id=|,
  class => qr//,
  alt => qr//
);
print $_->as_HTML, "\n" for @images;

__DATA__
<html>
  <head>
    <title>Page title</title>
  </head.
  <body>
    <img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">    
    <img  src="/file?id=13166" class="bbc_img" alt="myimagess.jpg">
    <img  src="/file" class="bbc_img" alt="myimagess.jpg"> /* mismatch id="" */
    <img  src="/file?id=13166" alt="myimagess.jpg">        /* no class="" */
    <img  src="/file?id=13166" class="bbc_img">            /* no alt="" */
  </body>
</html>

output

<img alt="myimagess.jpg" class="bbc_img" rel="lightbox[45451]" src="/file?id=13166" />
<img alt="myimagess.jpg" class="bbc_img" src="/file?id=13166" />

Upvotes: 2

mirod

Reputation: 16171

A proper way to do this, using HTML::TreeBuilder::XPath. This will ignore rel and any other attribute, as well as not depend on the order of attributes in the tag.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;
use Test::More tests => 1;

my $root= HTML::TreeBuilder::XPath->new_from_content( do { local undef $/; <DATA> });

# this is the important part 
my @imgs= $root->findnodes( '//img[starts-with( @src,"/file?id=") and @class and @alt]');

# checks the results
my $hits= join ' ', map { "H:" . src_id( $_->{src}) } @imgs;
is( $hits, 'H:13166 H:13167', "one test");

# shows how to access the attributes
foreach my $img (@imgs)
  { warn "hit: src= $img->{src} - class=$img->{class} - alt: $img->{alt} - id= ", src_id( $img->{src}), "\n"; }

exit; 

sub src_id
  { my( $src)= @_;
    return $src=~  m{/file\?id=(.+)$} ? $1 : 'no id'; 
  }

__DATA__
<html>
  <head><title>Test HTML</title></head.
  <body>
    <img rel="lightbox[45451]" src="/file?id=13166" class="bbc_img" alt="myimagess1.jpg">
    <img class="bbc_img" src="/file?id=13167" alt="myimagess2.jpg">
    <img src="/file?id=13168" class="bbc_img" >
    <img src="/file?id=13169" alt="myimagess3.jpg">
    <img  src="/foo" class="bbc_img" alt="myimagess.jpg4">
  </body>
</html>

Upvotes: 1

Regex to match `rel` attribute of `img` element which only exists sometimes

Answers (3)

Related Questions