Reputation: 3992
Input file:
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
I like to match the tag <TD><PRE> sample</PRE></TD>
and if it is matched i like to get the result from the previous tag which is <TD>This is a TD cell</TD>
Output:
This is a TD cell
I tried with the below code:
MY $Output = m/<TD.*?\/TD>/;
I am able to match the tag but unable to get the result from the previous tag by matching the same.Can any one let me out with it. Thanks in advance.
Upvotes: 1
Views: 191
Reputation: 20280
Since you will need to go backwards, I think that probably building a full tree might be needed. Normally I recommend a DOM-style HTML parser (see Mojo::DOM
) but for building a tree, try something like HTML::Tree
.
EDIT:
So I decided to see if I could do this with Mojo::DOM
, and it worked rather nicely:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.10.0;
use Mojo::DOM;
my $dom = Mojo::DOM->new->xml(1)->parse(<<'HTML');
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
HTML
my $collection = $dom->find('TR TD');
my $i = -1; # so that first increment makes 0
$collection->first(sub{$i++; /sample/;});
say $collection->[$i-1];
You have to force XML parsing since HTML5 doesn't use upper case tags, but the rest should be self explanatory.
Edit Nov 1, 2012
Mojolicious 3.54 was just released and it gave Mojo::DOM the new next
and previous
methods, which help here. (I used this post as a case example for their use). That means, now you can do:
say $dom->find('TR TD')->first(qr/sample/)->previous;
rather than the last 4 lines of the example above.
Upvotes: 1
Reputation: 6204
Although we're often cautioned against writing our own html regexs against using mature html parsers, sometimes the former may do the job. See if this option helps (and you may want to match a little more of the <PRE>
tag):
use Modern::Perl;
my $html = <<'html';
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
html
say $html =~ m|<TD>(.*?)</TD>.*<TD><PRE>|is;
Output:
This is a TD cell
Upvotes: 0
Reputation: 4063
You can use lookbehind and lookahead to assert that a text is preceded or followed by another - the lookarounds are zero-width assertions which means that they don't capture anything:
(?<=TD>)[^>]+(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)
which means:
(?<=TD>)
- look behind from the position where you are and assert that there is a tag[^>]+
- match everything that is not the end of a tag(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)
- and look ahead from the position where you are and assert that the following text is </TD>\s*<TD><PRE>\s*sample</PRE></TD>
(closing tag, optional whitespace characters and your condition)The result of this match is the text matched by #2.
Upvotes: 0
Reputation: 20270
This isn't really a good problem for regex. The best you can do with a single expression is to match both cells and capture the contents of the first cell in a group. e.g.
<TD>(.*?)</TD>\s*<TD><PRE> sample</PRE></TD>
I guess you'd need to replace whatever <PRE> sample</PRE>
would be with another expression, but you haven't provided enough information about that here.
Using a html parser which can actually traverse the document tree would be a better option if you need to do this more generically.
Upvotes: 0