pavan
pavan

Reputation: 314

How can I extract the values from this key-value pair?

I have a key-value pair separated by ',', as shown below. I need to extract the values only, whether it exists or not.

Category=, userAgent=Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko, referringURL=https://www.localhost.com/account/pay?link=credit_card, criteria=InFormCriteria(cc='MZ',tend=123,cd='parts')

I have used the following code,

while(<FH>){
    while($_=~m/([^=]+)=([^\s]+,?)/g){
        print $1." ";
    }
    print "\n";
}

and I get the following output:

, Mozilla/5.0 https://www.localhost.com/account/pay?link=credit_card, InFormCriteria(cc='MZ',tend=123,cd='parts')

However, I need to get:

""@@Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko@@https://www.localhost.com/account/pay?link=credit_card@@InFormCriteria(cc='MZ',tend=123,cd='parts')

What am I doing wrong?

Upvotes: 2

Views: 121

Answers (2)

n0741337
n0741337

Reputation: 2514

Your actual delimiter looks more like , ( comma followed by a space ) to me. Provided that the values of the key=value pairs don't contain the same delimiter, using gawk you could:

gawk '{sub(/^\w+=/, ""); gsub( /, \w+=/, "@@"); print}'

which drops the key= part in the first field, then converts the others into @@. Your sample data outputs like this for me:

@@Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko@@https://www.localhost.com/account/pay?link=credit_card@@InFormCriteria(cc='MZ'@@123@@'parts')

If you really need to have empty values denoted as "", you could use regular gawk/awk in a script like:

#!/usr/bin/awk -f

BEGIN {FS=", "; OFS="@@"}

{
    for(i=1; i<=NF; i++) {
        val = substr( $i, index( $i, "=" )+1 )
        if( val=="" ) val="\"\""
        printf "%s%s", val, (i<NF?OFS:"\n")
    }
}

Alternatively, you could just sub or gsub those fields to "" as well. That script outputs the following for me:

""@@Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko@@https://www.localhost.com/account/pay?link=credit_card@@InFormCriteria(cc='MZ',tend=123,cd='parts')

These solutions both assume that each field in the form of key=value and that no values contain the , ( comma followed by a space ). If the latter isn't true, then you might want to change the logging delimiter (if you can) to something that's more distinct. Or, if you can determine distinct cases where , is in a value ( say between balance parens ), then you could alter those before parsing for primary key=value pairs.

Upvotes: 1

Sobrique
Sobrique

Reputation: 53498

It's more annoying than it sounds, since your string uses inconsistent delimiters. It'll be difficult to parse with an RE as a result, and will always be unreliable.

Modules exist to do this - as mentioned by Wintermute, HTTP::BrowserDetect is built for parsing this particular sort of string.

If you're really set on doing it the hard way - the 'simple' split_on_delimiter approach won't work, thanks to having nested elements in brackets. So I'd suggest - pick out the keys with a regex (because they're always a word, followed by =).

Then, create a bunch of 'sub' regexes, to parse that.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $string =
    q{Category=, userAgent=Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko, referringURL=https://www.localhost.com/account/pay?link=credit_card, criteria=InFormCriteria(cc='MZ',tend=123,cd='parts')};

my @keys = ( $string =~ m/(?:^|\s)(\w+)=/g );
my %parsed_thing;

for my $index ( 0 .. $#keys ) {
    my $regex =
          $keys[$index]
        . '=(.*?)[, ]*'
        . ( defined $keys[ $index + 1 ] ? $keys[ $index + 1 ] : '$' );
    print "Using a RE of: ", $regex, "\n";

    my ($value) = ( $string =~ m/$regex/ );
    print "\tGot: $keys[$index] => $value\n";
    $parsed_thing{ $keys[$index] } = $value;
}

print join( '@@', values %parsed_thing ),"\n";
#or
print join( '@@', @parsed_thing{@keys} ),"\n";

Upvotes: 1

Related Questions