Reputation:

Perl and Pattern Matching

I've been working on something that takes an html file with a bibliography and strips everything down except for the authors. I'm having a hard time getting rid of some extraneous data such as the characters in the html tags. I'd like to be able to just strip away the whole tag or even better, specific data between the tags.

Right now here is what my sub looks like:

    sub extractAuthorsIntoArray{
        @author_array = split /[<>"\/?!.=\(\)1234567890':]/, $doc;
        foreach(@author_array){
            print "$_" . "\n";
        }
    }

At this moment what it does is strip away all the tag characters but it leaves a bunch of extraneous data that I don't want such as publish date as well as publication name and such other data I don't need. Any time I try to get rid of say "< li >" it gives me my new data with those characters missing altogether. Anyways, I'll keep hammering at it.

Laters.

EDIT:

What I'd like to do is take something like this:

< li value="2">Artem Chebotko and Shiyong Lu, "Nested Optional Join for Efficient Evaluation of SPARQL Nested Optional Graph Patterns". Progressive Concepts for Semantic Web Evolution: Applications and Developments, Miltiadis Lytras and Amit Sheth (Eds.), Information Science Publishing, ISBN 160566992X, 2010. < /li> < li>Artem Chebotko, Shiyong Lu, Farshad Fotouhi, and Anthony Aristar, "Ontology-Based Annotation of Multimedia Language Data for the Semantic Web". Semantic Web-Based Information Systems: State-of-the-Art Applications, Amit Sheth and Miltiadis Lytras (Eds.), IGI Global, ISBN 1599044269, 2006. < /li>

And end up with this:

Artem Chebotko and Shiyong Lu

Upvotes: 0

Answers (4)

Scavokovich

Reputation: 91

#!/usr/bin/perl -w

use strict;
read DATA, my $string, -s DATA;
my @matches = ( $string =~ /<\s+li\s*(?:.*?)>(.+?),\s+<\s+b>/g );
print "$_\n\n" foreach (@matches);

__DATA__
< li value="2">Artem Chebotko and Shiyong Lu, < b>"Nested Optional Join for Efficient Evaluation of SPARQL Nested Optional Graph Patterns"< /b>. < i>Progressive Concepts for Semantic Web Evolution: Applications and Developments< /i>, Miltiadis Lytras and Amit Sheth (Eds.), Information Science Publishing, ISBN 160566992X, 2010.< br/>< br/>< /li> < li>Artem Chebotko, Shiyong Lu, Farshad Fotouhi, and Anthony Aristar, < b>"Ontology-Based Annotation of Multimedia Language Data for the Semantic Web"< /b>. < i>Semantic Web-Based Information Systems: State-of-the-Art Applications< /i>, Amit Sheth and Miltiadis Lytras (Eds.), IGI Global, ISBN 1599044269, 2006.< br/>< br/>< /li>

If you're willing to solve this specific problem, then what your regex should be looking for is either:

a) < li value="2">AUTHORS, < b>
b) < li>AUTHORS, < b>

For a) one possible regex is:

< \s+ li \s+ value="2"> (.+), \s+ <\s+b>

For b) one possible regex is:

< \s+ li> (.+), \s+ <\s+b>

Combining these two regexs yeilds:

<\s+li\s*(?:.*?)>(.+?),\s+<\s+b>

Not elegant & etc. but maybe it'll help you.

Upvotes: 0

Dave Sherohman

Reputation: 46235

That's a rather... unusual... way to use split. It's normally used when you have data containing several data items separated by delimiters to split the data on those delimiters and retrieve the individual items. Which isn't what you're trying to do here, so split is probably not the ~~droid~~ command you're looking for.

As already mentioned, a proper HTML parser would really be The Right Way to do this, but you specifically want to use a regex for educational purposes, so I'll give you one. Just be aware that parsing HTML with regexes is fraught with danger and there are almost certainly edge cases where this will fail.

So, that said:

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

my $text = q[< li value="2">Artem Chebotko and Shiyong Lu, < b>"Nested Optional Join for Efficient Evaluation of SPARQL Nested Optional Graph Patterns"< /b>. < i>Progressive Concepts for Semantic Web Evolution: Applications and Developments< /i>, Miltiadis Lytras and Amit Sheth (Eds.), Information Science Publishing, ISBN 160566992X, 2010.< br/>< br/>< /li> < li>Artem Chebotko, Shiyong Lu, Farshad Fotouhi, and Anthony Aristar, < b>"Ontology-Based Annotation of Multimedia Language Data for the Semantic Web"< /b>. < i>Semantic Web-Based Information Systems: State-of-the-Art Applications< /i>, Amit Sheth and Miltiadis Lytras (Eds.), IGI Global, ISBN 1599044269, 2006.< br/>< br/>< /li>];

my @list_items = $text =~ m[<\s*li(?:\s+[^>]*)?>(.*?)<\s*/li\s*>]g;

my @authors;
for (@list_items) {
  /([^<]+), </;
  push @authors, $1;
}

say for @authors;

Output:

Artem Chebotko and Shiyong Lu
Artem Chebotko, Shiyong Lu, Farshad Fotouhi, and Anthony Aristar

Upvotes: 1

rra

Reputation: 3887

The problem is hard to solve in general without some certainty about the structure of the data, but based on your example, I'll make the assumption that the authors are always the first non-tag content of your data and are terminated by a comma (which is a pretty common format).

That means the problem has two parts: strip any initial HTML tags, and then drop everything after the comma.

For the first, an HTML tag is fairly easy to recognize, since it starts with < and ends with > and can't contain either of those characters. So:

$line =~ s{ \A \s* (?: < [^>]+ > \s* )+ }{}xms;

will remove all HTML tags (and whitespace) at the start of a line. (This uses the /x flag and other coding style as recommended by Perl Best Practices.) Going through this step by step, \A matches the beginning of the string, \s* matches any amount of whitespace, and the core is < [^>]+ >, which matches the HTML tag by looking for the start of the tag and then taking one or more characters until the end of the tag. This is enclosed in (?: )+ to allow any number of them. (I'm using (?:) instead of just () since it's best practice to turn off capturing if you don't care about keeping that match.)

Removing everything from the comma afterwards is much easier:

$line =~ s{ , .* }{}xms;

Now, this assumes that each bibiography entry is a single scalar in your program. That glosses over a rather large problem; if instead you have a variable that contains the whole page, you may need to parse that. If each entry is an <li> tag, what you want to do is extract the contents of each <li> tag and then process it as above.

To do that, match in a list context with the /g option doing something like this:

my @entries = ($doc =~ m{ <li (?: \s [^>] )? > (.*?) </li> }xmsg);

Some more subtleties here. The (?: )? bit after <li optionally matches whitespace followed by some number of characters other than > to allow for any attributes to that tag. The (.*?) part does the actual work of extracting the content of the tag. Note the ? after the *. This makes the match non-greedy, which means that rather than matching everything up to the last </li> tag in the document, it matches everything up to the first </li> tag. Finally, the /g modifier says to repeat this match as many times as possible, and return the contents of the capturing () as a list.

Upvotes: 1

Miguel Prz

Reputation: 13792

My recommendation: don't use regular expressions. Instead of that, use HTML::Parser or one of the many modules that are available at CPAN.

Upvotes: 1

Perl and Pattern Matching

Answers (4)

Related Questions