Heithem Mokaddem
Heithem Mokaddem

Reputation: 13

Perl parsing Text File with regular expression

I have a file with the following random structures:

USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="AAA" FORMAT="ascii" TEXT="L2"

or

USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="BBB" THRESHOLDID="1" FORMAT="ascii" TEXT="L2"

I am trying to parse it with perl to get the values like the following:

1362224754632;00966590832186;580;AAA;L2

Below is the code:

if($Record =~ /USMS (.*?)|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" THRESHOLDID="(.*?)" TEXT="(.*?)"/)
{
                              print LOGFILE "$1;$2;$3;$4;$5;$6;$7\n";
}
elsif($Record =~ /USMS (.*?)|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" TEXT="(.*?)"/)
{
                              print LOGFILE "$1;$2;$3;$4;$5;$6\n";
}

But I am getting always:

;;;;;

Upvotes: 1

Views: 239

Answers (4)

Borodin
Borodin

Reputation: 126722

It looks like all you want is the fields contained in double-quotes.

That looks like this

use strict;
use warnings;

while (<DATA>) {
  my @values = /"([^"]+)"/g;
  print join(';', @values), "\n";
}

__DATA__
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="AAA" FORMAT="ascii" TEXT="L2"
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="BBB" THRESHOLDID="1" FORMAT="ascii" TEXT="L2"

output

00966590832186;580;AAA;ascii;L2
00966590832186;580;BBB;1;ascii;L2

Upvotes: 0

Schwern
Schwern

Reputation: 164809

Instead of using a single regex, I would split the data into its separate sections first, then approach them separately.

my($usms_part, $request) = split / \s* \|<REQ \s* /x, $Record;

my($usms_id) = $usms_part =~ /^USMS (\d+)$/;

my %request;
while( $request =~ /(\w+)="(.*?)"/g ) {
    $request{$1} = $2;
}

Rather than having to hard code all the possible key/value pairs, and their possible orderings, you can parse them generically in one piece of code.

Upvotes: 3

Birei
Birei

Reputation: 36262

Pipe (|) is a special character in regular expressions. Escape it, like: \| and it will work.

if($Record =~ /USMS (.*?)\|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" THRESHOLDID="(.*?)" TEXT="(.*?)"/)

and the same for the else branch.

Upvotes: 3

Myforwik
Myforwik

Reputation: 3588

Change

(.*?) 

to

([a-zA-Z0-9]*)

Upvotes: 1

Related Questions