Reputation: 10460
I am trying to write a Perl script that will change a line that looks like this ...
<li><em>01 – Chapters 1-4</em> – 00:14:36 <br />
... and make it look like this ...
01 – Chapters 1-4
... no big deal right? I just do the \(.*\)
thing in my Perl script like so:
#!/usr/bin/perl -w
use strict;
while(<DATA>) {
my $line = $_;
chomp($line);
if ( $line =~ /^<li>/ ) {
$line =~ s/<em>\(.*\)<\/em>/$1/g;
print "[" . $line . "]\n";
}
}
__DATA__
<li><em>01 – Chapters 1-4</em> – 00:14:36 <br />
<li><em>02 – Chapters 5-8</em> – 00:10:52 <br />
<li><em>03 – Chapters 9-14</em> – 00:19:16 <br />
<li><em>04 – Chapters 15-18</em> – 00:13:30 <br />
<li><em>05 – Chapters 19-22</em> – 00:17:01 <br />
<li><em>06 – Chapters 23-25</em> – 00:16:44 <br />
<li><em>07 – Chapter 26</em> – 00:10:35 <br />
red@ubuntu:~/scripts$ ./test.pl
When I run the script I get this output ...
[<li><em>01 – Chapters 1-4</em> – 00:14:36 <br />]
[<li><em>02 – Chapters 5-8</em> – 00:10:52 <br />]
[<li><em>03 – Chapters 9-14</em> – 00:19:16 <br />]
[<li><em>04 – Chapters 15-18</em> – 00:13:30 <br />]
[<li><em>05 – Chapters 19-22</em> – 00:17:01 <br />]
[<li><em>06 – Chapters 23-25</em> – 00:16:44 <br />]
[<li><em>07 – Chapter 26</em> – 00:10:35 <br />]
... what am I doing wrong here?
Thanks
UPDATE:
Thanks for all your replies. They are very helpful. I have changed my code to this ...
red@ubuntu:~/scripts$ cat test.pl
#!/usr/bin/perl -w
use strict;
while(<DATA>) {
my $line = $_;
chomp($line);
if ( $line =~ /^<li>/ ) {
$line =~ s/<em>(.*)<\/em>/$1/g;
print "[" . $line . "]\n";
}
}
__DATA__
<li><em>01 – Chapters 1-4</em> – 00:14:36 <br />
<li><em>02 – Chapters 5-8</em> – 00:10:52 <br />
<li><em>03 – Chapters 9-14</em> – 00:19:16 <br />
<li><em>04 – Chapters 15-18</em> – 00:13:30 <br />
<li><em>05 – Chapters 19-22</em> – 00:17:01 <br />
<li><em>06 – Chapters 23-25</em> – 00:16:44 <br />
<li><em>07 – Chapter 26</em> – 00:10:35 <br />
... but still does not get to the output I want I get this instead ...
red@ubuntu:~/scripts$ ./test.pl
[<li>01 – Chapters 1-4 – 00:14:36 <br />]
[<li>02 – Chapters 5-8 – 00:10:52 <br />]
[<li>03 – Chapters 9-14 – 00:19:16 <br />]
[<li>04 – Chapters 15-18 – 00:13:30 <br />]
[<li>05 – Chapters 19-22 – 00:17:01 <br />]
[<li>06 – Chapters 23-25 – 00:16:44 <br />]
[<li>07 – Chapter 26 – 00:10:35 <br />]
... looks like the <em>
and </em>
got removed, but I just want the text between the <em>
and </em>
.
Upvotes: 1
Views: 111
Reputation: 126722
All you are doing is removing the <em>
tags from around the first part of the string. If you want to remove everything else as well, write this
use strict;
use warnings;
while(<DATA>) {
print "[$1]\n" if /^<li><em>([^<>]+)/;
}
__DATA__
<li><em>01 – Chapters 1-4</em> – 00:14:36 <br />
<li><em>02 – Chapters 5-8</em> – 00:10:52 <br />
<li><em>03 – Chapters 9-14</em> – 00:19:16 <br />
<li><em>04 – Chapters 15-18</em> – 00:13:30 <br />
<li><em>05 – Chapters 19-22</em> – 00:17:01 <br />
<li><em>06 – Chapters 23-25</em> – 00:16:44 <br />
<li><em>07 – Chapter 26</em> – 00:10:35 <br />
output
[01 – Chapters 1-4]
[02 – Chapters 5-8]
[03 – Chapters 9-14]
[04 – Chapters 15-18]
[05 – Chapters 19-22]
[06 – Chapters 23-25]
[07 – Chapter 26]
Upvotes: 2
Reputation: 14038
Your first and second attempts include the following:
$line =~ s/<em>\(.*\)<\/em>/$1/g; # First version
$line =~ s/<em>(.*)<\/em>/$1/g; # Second version
Neither version make any alteration to the right hand end of the lines. The command s/f/r/
says to search for something matching f
and replace that part with r
, implicitly the command means do nothing to the rest of the string.
Writing the command as
$line =~ s/<em>(.*)<\/em>.*/$1/g;
says to find (after the em>
) any number of characters up to but not including end-of-line or newline. So the command will strip off the other characters as wanted.
The s///
command may use other characters as the delimiter which can make searching for strings that include a /
easier. So the above might be more clearly written as
$line =~ s!<em>(.*)</em>.*!$1!g;
In the example you give there is no need to modify the string. The task described is to print the text within the <em>
and </em>
pair and discard the rest of the line. so the code in msw's answer does all that is needed. If you were processing a huge amount of text where performance is important then msw's method may be preferable.
Upvotes: 1
Reputation: 5764
You are using \(.*\)
which is matching for (
and )
. Use (.*)
for extracting matches.
Based on your update...you need to use the following
$line =~ s/<em>(.*)<\/em>(.*)/$1/g;
I strongly recommend you consider incorporating @AndyLester's comment.
Upvotes: 3
Reputation: 43487
You're substituting only the part of the line that matches in your updated version.
print "[$1]\n" if /<em>(.*)<\/em>/;
will give you only that which the (.*)
capture group caught. And then you needn't bother with the substitution.
But do be aware of Andy Lester's caution in the comments. This works just fine or your test data, but HTML is notorious for breaking your regexp, especially if you say the magic phrase "but my real HTML data will always be exactly in this form".
Upvotes: 6
Reputation: 385565
If you want to capture, you want
(...)
Escaped parens attempt to match parens.
Upvotes: 2