zdd
zdd

Reputation: 8727

Extract innerHTML from multiple tags

I have a task to extract the inner html text from an html link with Perl.

Here is an example,

<a href="www.stackoverflow.com">Regex Question</a>

I want to extract the string: Regex Question

Note that, the inner text might be empty like this. This example get an empty string.

<a href="www.stackoverflow.com"></a>

and the inner text might be enclosed with multiple tags like this.

<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>

I have tried to write a Perl regex for a while, but no success. especially, I don't know how to deal with multiple tags.

Upvotes: 1

Views: 352

Answers (5)

nu11p01n73R
nu11p01n73R

Reputation: 26667

How about something like

(?<=>)[^<>\/]*(?=<\/)

will match the string: Regex Question

example: http://regex101.com/r/sG4bZ1/1

Upvotes: 0

Miller
Miller

Reputation: 35198

Use an HTML Parser for parsing HTML.

I suggest you take a look at Mojo::DOM, and Mojo::UserAgent if you need to download the content from the web.

The following will pull all the links with the href containing stackoverflow.com and display the text inside:

use strict;
use warnings;

use Mojo::DOM;
use Data::Dump;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $link ($dom->find('a[href*="stackoverflow.com"]')->each) {
    dd $link->all_text;
}

__DATA__
<html>
<body>
<a href="www.stackoverflow.com">Regex Question</a>
I want to extract the string: Regex Question

<a href="www.notme.com">Don't want this link</a>
Note that, the inner text might be empty like this. This example get an empty string.

<a href="www.stackoverflow.com"></a>
and the inner text might be enclosed with multiple tags like this.

<a href="www.stackoverflow.com"><b><h2>Regex Question with tags</h2></b></a>
</body>
</html>

Outputs:

"Regex Question"
""
"Regex Question with tags"

For a helpful 8 minute introductory video, check out Mojocast Episode 5.

Upvotes: 3

Chankey Pathak
Chankey Pathak

Reputation: 21666

Parsing HTML through Regex is a bad idea, you're not Chuck Norris. You can use Mojo::DOM module which will make your task very easy.

A sample:

use Mojo::DOM;

# Parse
my $dom = Mojo::DOM->new('<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>');

# Find
say $dom->at('a')->text;
say $dom->find('a')->text;

To install Mojo::DOM just type the below command

$ cpan Mojo::DOM

Upvotes: 1

vks
vks

Reputation: 67968

<a[^>]*>(?:<[^>]*>)*([^<>]*)(?:<[^>]*>)*<\/a>

Try this.See demo.Grab the capture or match.

http://regex101.com/r/sU3fA2/1

Upvotes: 1

user557597
user557597

Reputation:

Should use html parser, but using a regex probably could be done.
This finds open to close A-tag pairs with no nesting of A-tags, and also
lets other tags be in the content.
If you want the a-tags content without other tags at all, it will be slightly different (not shown).

Since you are using Perl, this might work.

 # =~ /(?s)<a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)(?<!\/>)((?:(?!(?><a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)|<\/a\s*>)).)*)<\/a\s*>/

 (?s)
 <a                            # Begin A-tag, must (should) contain attrib/val's
 (?>
      \s+                      # (?!\s) add this if you think malformed '<a  >' could slip by
      (?: " .*? " | ' .*? ' | [^>]*? )+
      >
 )
 (?<! /> )                     # Lookbehind, Insure this is not a closed A-tag '<a/>'
 (                             # (1 start), Capture Content between open/close A-tags
      (?:                           # Cluster, match content
           (?!                           # Negative assertion
                (?>
                     <a                            # Not Start A-tag
                     (?>
                          \s+  
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                          >
                     )
                  |  </a \s* >                     #  and Not End A-tag
                )
           )
           .                             # Assert passed, consume a content character 
      )*                            # End Cluster, do 0 to many times
 )                             # (1 end)
 </a \s* >                     # End A-tag

Upvotes: 0

Related Questions