tipssch
tipssch

Reputation: 59

Perl regex - need to ignore first two instances of a character

I need to extract the message attribute from the following string (i.e. I want to extract The String "test" appears 4 times in the file.).

severity="warning" message="The String "test" appears 4 times in the file." source="com.puppycrawl.tools.checkstyle.checks.coding.MultipleStringLiteralsCheck"

I've tried using the regular expression message="([^"]*)" but this stops at the first " that appears. The String is getting returned in this case.

Is there a way to ignore the inner quotes within the message attribute and capture the entire attribute?

Upvotes: 1

Views: 143

Answers (3)

Borodin
Borodin

Reputation: 126722

This solution keeps fetching characters from the string until a new label like source= is encountered. All parameter values are stored in hash %params, so the value for message is just $params{message}

I've used Data::Dump only to display the complete hash contents once the string has been parsed

use strict;
use warnings 'all';
use feature 'say';

my $str = 'severity="warning" message="The String "test" appears 4 times in the file." source="com.puppycrawl.tools.checkstyle.checks.coding.MultipleStringLiteralsCheck"';

my %params;

while ( $str =~ / (\w+) \s* = \s* " ( (?: . (?! \w+ \s* = ) )* ) " /gsx ) {
    $params{$1} = $2;
}

say $params{message};

use Data::Dump;
dd \%params;

output

The String "test" appears 4 times in the file.
{
  message  => "The String \"test\" appears 4 times in the file.",
  severity => "warning",
  source   => "com.puppycrawl.tools.checkstyle.checks.coding.MultipleStringLiteralsCheck",
}

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

If we can assume that the key always consists of alphanumerics or underscore symbols (\w+) and is followed with = and the vlaues do not contain that pattern, you can use a lazy quantifier with a dot .*? and check the trailing boundary with a positive lookehead. Thus, as a quick-and-dirty once-time fix, you can use

message="(.*?)"(?=\s+\w+=|$)

See the regex demo

Note that . does not match newline symbols by default, you will need a /s modifier.

The input you have needs fixing by all means.

Upvotes: 1

Olaf Dietsche
Olaf Dietsche

Reputation: 74018

If the attributes are always in this order, i.e. source follows message, you might try to make it a bit more robust

message="(.*?)"\s+source="

This will break of course, if source= occurs in the message.

Upvotes: 1

Related Questions