Mattis Asp
Mattis Asp

Reputation: 1003

perl substr remove everything between two positions in string

So i have this file clip.txt that only contain:

<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>, 
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

Now i would like to remove everything between <...> so that i end up with

Kanye West , Christ Martin.

with perl i have the current code:

#!/usr/local/bin/perl

$file = 'clip.txt';
open(FILE, $file);
@lines = <FILE>;
close(FILE);
$line =  @lines[0];

while (index($line, "<") != -1) {
my $from = rindex($line, "<");
my $to = rindex($line, ">");

print $from;
print ' - ';
print $to;
print ' ';

print substr($line, $from, $to+1);
print '|'; // to see where the line stops
print "\n";
substr($line, $from, $to+1) = ""; //removes between lines
$counter += 1;

}

print $line;

all the "print" lines are rather redundant but good for debugging.

now the result becomes:

138 - 141 </a>
|
67 - 125 <a href="http://http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin|
61 - 64 </a>, |
0 - 50 <a href="https://en.wikipedia.org/wiki/Kanye_West">|
Kanye West

First the script find position between 138 -141, and removes it. Then it finds 67 - 125 but it removes 67 - 137. Next it finds 61 - 64 but it removes 61 - 66.

Why does it do this? On the bottom line it finds 0 - 64, and it removes perfectly. So i cannot find the logic here.

Upvotes: 0

Views: 1035

Answers (4)

ysth
ysth

Reputation: 98398

substr's third parameter is length, not ending index, so you should pass $to-$from+1.

(Though you should also adjust your code to make sure it finds both a < and a >, and that the > is after the <.)

Upvotes: 4

memowe
memowe

Reputation: 2668

While a simple regex substitution should do what you want on the example data, parsing (X)HTML with regexes is generally a bad idea (and doing the same thing with a simple character search is basically the same). A more flexible and better readable approach would be to use a proper HTML parser.

Example with Mojo::DOM:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# slurp data into a parser object
my $dom = Mojo::DOM->new(do { local $/; <DATA> });

# iterate all links
for my $link ($dom->find('a')->each) {

    # print the link text
    say $link->text;
}

__DATA__
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>, 
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

Output:

Kanye West
Chris Martin

Upvotes: 3

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118128

The proper solution is indeed to use something like HTML::TokeParser::Simple. However, if you are just doing this as a learning exercise, you can simplify it by extracting what you want rather than removing what you don't:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';

while (my $line = <DATA>) {
    my $x = index $line, '>';
    next unless ++$x;
    my $y = index $line, '<', $x;
    next unless $y >= 0;
    say substr($line, $x, $y - $x);
}

__DATA__
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>,
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

Output:

Kanye West
Chris Martin

On the other hand, using an HTML parser isn't really that complicated:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while (my $anchor = $parser->get_tag('a')) {
    my $text = $parser->get_text('/a');
    say $text;
}

__DATA__
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>,
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

Upvotes: 3

choroba
choroba

Reputation: 241878

You can use the s/// operator:

$line =~ s/<[^>]+>//g

Upvotes: 4

Related Questions