Reputation: 415
I am New to Regex I want to convert this different case of input to Expected output.
input
CASE 1 :
<sec id="S_4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)</italic>.</title>
CASE 2 :
<sec id="S_4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>).</italic></title><br>
CASE 3 :
<sec id="S_4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)<bold>.</bold></italic></title>
Expected output:
<sec id="S_4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>
I want to Remove the Punctuation at the end of the title and also remove the formating tag in the title..
please provide the regex for this.
thanks in advance.
I tried this code :: but cant able to do further
while($cnt =~m{<sec( [^>]*)?><label( [^>]+)?>(.*?)</label>)(.*?)(<title( [^>]*)?>)(.*?)</title>)}ig){
my $temp = $5;
$temp = ~s{<title( [^>]*)?>)(.*?)</title>}{}ig;
}
Upvotes: 0
Views: 116
Reputation: 35198
Welcome to regular expressions. They are a powerful tool, but I would strongly advise you to use an actual XML or HTML Parser if that is what your data is.
At minimum, you should use the /x
modifier in order to add spacing to the LHS of your regular expressions. There were a number of redundant groupings that I removed and other cleaning up that I did to them:
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
$line =~ s{
(
<sec\b[^>]*>
\s*
<label\b[^>]*>
(?:(?!</?label\b).)*
</label>
(?:(?!<title\b).)* # This assumes a <title> under a <sec> (not good)
<title\b[^>]*>
)
(
(?:(?!</?title\b).)*
)
</title>\s*
}{
my $pre = $1;
my $title = $2;
1 while $title =~ s{
\A
([\s\p{Punct}]*)
<(\w+)> (.*) </\2>
([\s\p{Punct}]*)
\z
}{$1$3$4}isgx;
$title =~ s{<(bold|italic)>[.]+</\1>\z}{}i;
$title =~ s{[.]+\z}{};
"$pre$title</title>"
}isgex;
print $line, "\n";
}
__DATA__
<sec id="S_4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)</italic>.</title>
<sec id="S_4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>).</italic></title>
<sec id="S_4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)<bold>.</bold></italic></title>
Outputs:
<sec id="S_4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>
<sec id="S_4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>
<sec id="S_4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>
Upvotes: 0
Reputation: 156
$clean =~ s{(<sec(?: [^>]+)?>(?:\s*<label(?: [^>]+)?>(?:(?!</?label[ >]).)*</label>)(?:(?!<title[ >]).)*<title(?: [^>]+)?>)(((?:(?!</?title[ >]).)*))</title>\s*}{
my $pre = $1;
my $title = $2;
$title =~ s{((<(bold|italic)>)?((?:(?!</?\1>).)*)(</\3>))(<(bold|italic)>)?([\.])?$}{
my $pre = $2;
my $cnt = $4;
my $post = $5;
$cnt =~s{(<(bold|italic)>)?[\.](</\2>)$}{}ig;
$cnt =~s{[\.]$}{}ig;
qq($pre$cnt$post)
}igse;
qq($pre$title</title>)
}isge;
try this code. This might help you. This code is written in inline format.
Upvotes: 1