Anil
Anil

Reputation: 3992

how to match a particular tag value and the get the result from the previous tag after matching?

Input file:

<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>

I like to match the tag <TD><PRE> sample</PRE></TD> and if it is matched i like to get the result from the previous tag which is <TD>This is a TD cell</TD>

Output:

This is a TD cell

I tried with the below code:

MY $Output = m/<TD.*?\/TD>/;

I am able to match the tag but unable to get the result from the previous tag by matching the same.Can any one let me out with it. Thanks in advance.

Upvotes: 1

Views: 191

Answers (4)

Joel Berger
Joel Berger

Reputation: 20280

Since you will need to go backwards, I think that probably building a full tree might be needed. Normally I recommend a DOM-style HTML parser (see Mojo::DOM) but for building a tree, try something like HTML::Tree.

EDIT:

So I decided to see if I could do this with Mojo::DOM, and it worked rather nicely:

#!/usr/bin/env perl

use strict;
use warnings;

use 5.10.0;
use Mojo::DOM;

my $dom = Mojo::DOM->new->xml(1)->parse(<<'HTML');
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
HTML

my $collection = $dom->find('TR TD');
my $i = -1; # so that first increment makes 0
$collection->first(sub{$i++; /sample/;});
say $collection->[$i-1];

You have to force XML parsing since HTML5 doesn't use upper case tags, but the rest should be self explanatory.

Edit Nov 1, 2012

Mojolicious 3.54 was just released and it gave Mojo::DOM the new next and previous methods, which help here. (I used this post as a case example for their use). That means, now you can do:

say $dom->find('TR TD')->first(qr/sample/)->previous;

rather than the last 4 lines of the example above.

Upvotes: 1

Kenosis
Kenosis

Reputation: 6204

Although we're often cautioned against writing our own html regexs against using mature html parsers, sometimes the former may do the job. See if this option helps (and you may want to match a little more of the <PRE> tag):

use Modern::Perl;

my $html = <<'html';
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
html

say $html =~ m|<TD>(.*?)</TD>.*<TD><PRE>|is;

Output:

This is a TD cell

Upvotes: 0

Joanna Derks
Joanna Derks

Reputation: 4063

You can use lookbehind and lookahead to assert that a text is preceded or followed by another - the lookarounds are zero-width assertions which means that they don't capture anything:

(?<=TD>)[^>]+(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)

which means:

  1. (?<=TD>) - look behind from the position where you are and assert that there is a tag
  2. [^>]+ - match everything that is not the end of a tag
  3. (?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>) - and look ahead from the position where you are and assert that the following text is </TD>\s*<TD><PRE>\s*sample</PRE></TD> (closing tag, optional whitespace characters and your condition)

The result of this match is the text matched by #2.

Upvotes: 0

beerbajay
beerbajay

Reputation: 20270

This isn't really a good problem for regex. The best you can do with a single expression is to match both cells and capture the contents of the first cell in a group. e.g.

<TD>(.*?)</TD>\s*<TD><PRE> sample</PRE></TD>

I guess you'd need to replace whatever <PRE> sample</PRE> would be with another expression, but you haven't provided enough information about that here.

Using a html parser which can actually traverse the document tree would be a better option if you need to do this more generically.

Upvotes: 0

Related Questions