depsai
depsai

Reputation: 415

Regular expression request?

I am New to Regex I want to convert this different case of input to Expected output.

input

CASE 1 :

<sec id="S&#x005F;4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)</italic>.</title>

CASE 2 :

 <sec id="S&#x005F;4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>).</italic></title><br>

CASE 3 :

<sec id="S&#x005F;4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)<bold>.</bold></italic></title>


Expected output:

<sec id="S&#x005F;4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>


I want to Remove the Punctuation at the end of the title and also remove the formating tag in the title.. please provide the regex for this. thanks in advance.

I tried this code :: but cant able to do further

while($cnt =~m{<sec( [^>]*)?><label( [^>]+)?>(.*?)</label>)(.*?)(<title( [^>]*)?>)(.*?)</title>)}ig){
      my $temp = $5;
      $temp = ~s{<title( [^>]*)?>)(.*?)</title>}{}ig;
}

Upvotes: 0

Views: 116

Answers (2)

Miller
Miller

Reputation: 35198

Welcome to regular expressions. They are a powerful tool, but I would strongly advise you to use an actual XML or HTML Parser if that is what your data is.

At minimum, you should use the /x modifier in order to add spacing to the LHS of your regular expressions. There were a number of redundant groupings that I removed and other cleaning up that I did to them:

use strict;
use warnings;

while (my $line = <DATA>) {
    chomp $line;

    $line =~ s{
        (
            <sec\b[^>]*>
            \s*
            <label\b[^>]*>
            (?:(?!</?label\b).)*
            </label>
            (?:(?!<title\b).)*      # This assumes a <title> under a <sec> (not good)
            <title\b[^>]*>
        )
        (
            (?:(?!</?title\b).)*
        )
        </title>\s*
    }{
        my $pre = $1;
        my $title = $2;

        1 while $title =~ s{
            \A
            ([\s\p{Punct}]*)
            <(\w+)> (.*) </\2>
            ([\s\p{Punct}]*)
            \z
        }{$1$3$4}isgx;

        $title =~ s{<(bold|italic)>[.]+</\1>\z}{}i;
        $title =~ s{[.]+\z}{};

        "$pre$title</title>"
    }isgex;

    print $line, "\n";
}

__DATA__
<sec id="S&#x005F;4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)</italic>.</title>
<sec id="S&#x005F;4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>).</italic></title>
<sec id="S&#x005F;4"><label>2.2.6.4.</label><title><italic> Content abc (<bold>15</bold>)<bold>.</bold></italic></title>

Outputs:

<sec id="S&#x005F;4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>
<sec id="S&#x005F;4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>
<sec id="S&#x005F;4"><label>2.2.6.4.</label><title> Content abc (<bold>15</bold>)</title>

Upvotes: 0

parthi
parthi

Reputation: 156

$clean =~ s{(<sec(?: [^>]+)?>(?:\s*<label(?: [^>]+)?>(?:(?!</?label[ >]).)*</label>)(?:(?!<title[ >]).)*<title(?: [^>]+)?>)(((?:(?!</?title[ >]).)*))</title>\s*}{
    my $pre = $1;
    my $title = $2;
    $title =~ s{((<(bold|italic)>)?((?:(?!</?\1>).)*)(</\3>))(<(bold|italic)>)?([\.])?$}{
        my $pre = $2;
        my $cnt = $4;
        my $post = $5;
        $cnt =~s{(<(bold|italic)>)?[\.](</\2>)$}{}ig;
        $cnt =~s{[\.]$}{}ig;
        qq($pre$cnt$post)
    }igse;
    qq($pre$title</title>)
}isge;

try this code. This might help you. This code is written in inline format.

Upvotes: 1

Related Questions