Reputation: 1003
So i have this file clip.txt that only contain:
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>,
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>
Now i would like to remove everything between <...> so that i end up with
Kanye West , Christ Martin.
with perl i have the current code:
#!/usr/local/bin/perl
$file = 'clip.txt';
open(FILE, $file);
@lines = <FILE>;
close(FILE);
$line = @lines[0];
while (index($line, "<") != -1) {
my $from = rindex($line, "<");
my $to = rindex($line, ">");
print $from;
print ' - ';
print $to;
print ' ';
print substr($line, $from, $to+1);
print '|'; // to see where the line stops
print "\n";
substr($line, $from, $to+1) = ""; //removes between lines
$counter += 1;
}
print $line;
all the "print" lines are rather redundant but good for debugging.
now the result becomes:
138 - 141 </a>
|
67 - 125 <a href="http://http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin|
61 - 64 </a>, |
0 - 50 <a href="https://en.wikipedia.org/wiki/Kanye_West">|
Kanye West
First the script find position between 138 -141, and removes it. Then it finds 67 - 125 but it removes 67 - 137. Next it finds 61 - 64 but it removes 61 - 66.
Why does it do this? On the bottom line it finds 0 - 64, and it removes perfectly. So i cannot find the logic here.
Upvotes: 0
Views: 1035
Reputation: 98398
substr
's third parameter is length, not ending index, so you should pass $to-$from+1
.
(Though you should also adjust your code to make sure it finds both a <
and a >
, and that the >
is after the <
.)
Upvotes: 4
Reputation: 2668
While a simple regex substitution should do what you want on the example data, parsing (X)HTML with regexes is generally a bad idea (and doing the same thing with a simple character search is basically the same). A more flexible and better readable approach would be to use a proper HTML parser.
Example with Mojo::DOM:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
# slurp data into a parser object
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# iterate all links
for my $link ($dom->find('a')->each) {
# print the link text
say $link->text;
}
__DATA__
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>,
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>
Output:
Kanye West
Chris Martin
Upvotes: 3
Reputation: 118128
The proper solution is indeed to use something like HTML::TokeParser::Simple. However, if you are just doing this as a learning exercise, you can simplify it by extracting what you want rather than removing what you don't:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
while (my $line = <DATA>) {
my $x = index $line, '>';
next unless ++$x;
my $y = index $line, '<', $x;
next unless $y >= 0;
say substr($line, $x, $y - $x);
}
__DATA__
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>,
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>
Output:
Kanye West Chris Martin
On the other hand, using an HTML parser isn't really that complicated:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
while (my $anchor = $parser->get_tag('a')) {
my $text = $parser->get_text('/a');
say $text;
}
__DATA__
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>,
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>
Upvotes: 3