DW2
DW2

Reputation: 11

How to use match in AWK for a given set of data

I'm working on processing some results obtained from a curl command in AWK but despite reading about match and regexps I'm still having some issues. I've got everything written, but in a really hackish way that's using a lot of substr and really basic match usage without capturing anything with a regexp.

My real data is a bit more complicated, but here's a simplified version. Assume the following is stored in a string, str:

[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]

Some things to note about this data:

Note that there are 3 "sets" of data delimited by {} in the first brackets [] and 2 sets in the second brackets. The string always has at least one set of data in each set of brackets, and at least one set of brackets (i.e. it will never be the empty string and will always have SOME valid data in it)

Brackets are also used for the DataC data, so those need to be considered in some way

No punctuation will ever appear in the string aside from delimiters -- all actual data is alphanumeric

The fields DataA, DataBee, and DataC will always have those names

The data for DataC will always be exactly 5 numbers, separated by commas

What I'd like to do is write a loop that will go through the string and pull out the values -- a = whatever DataA is (200 in the first case), b = whatever DataBee is (63500 in the first case), and c[1] through c[5] containing the values from DataC.

I feel like if I could just get ideas about how to do this for the above data I could run with it to adapt it to my needs. As of right now the loop I have for this using substr is like 30 lines long :(

Upvotes: 1

Views: 93

Answers (2)

Thor
Thor

Reputation: 47099

I would recommend using jq, e.g.:

jq -c '.[]' <<<"$str"
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}

To extract DataC:

jq -c '.[] | .DataC' <<<"$str"

Output:

[3,22,64,838,2]
[55,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]

Upvotes: 1

Corentin Limier
Corentin Limier

Reputation: 5006

For fun using awk :

I use "complex" FS and RS variables to split the json. This way, I have one value max per column, and 1 data per line (DataA, DataBee, DataC).

To understand the usage of the FS and RS, see how this command works :

awk -F",|\":\"|:\\\[" '
    {$1=$1}1
' OFS="\t" RS="\",\"|},{|\\\]" file

(you can replace file with <(curl <your_url>) or <(echo <your_json_str>))

Returns :

[{"DataA        200                           
DataBee 63500                                 
DataC"  3       22      64      838     2     

"DataA  190                                   
DataBee 63100                                 
DataC"  55      22      64      838     2     

"DataA  200                                   
DataBee 63500                                 
DataC"  3       22      64      838     2     
}                                             
[{"DataA        200                           
DataBee 63500                                 
DataC"  3       22      64      838     2     

"DataA  200                                   
DataBee 63500                                 
DataC"  3       22      64      838     2     
}                                     

Now it looks like something I can use with awk :

awk -F",|\":\"|:\\\[" '
    /DataA/{a=$2}
    /DataBee/{b=$2}
    /DataC/{for(i=2;i<=NF;i++){c[i-1]=$i}}
    a!=""&&b!=""&&c[1]!=""{
        print "a: ", a; 
        print "b: ", b; 
        printf "c: "; 
        for(i in c){
            printf "%s, ", c[i]
        }; 
        print ""; 
        a=""; b=""; c[1]=""
    }
' RS="\",\"|},{|\\\]" file

This command stores the value inside variables and prints them when a and b and c are set.

Returns :

a:  200
b:  63500
c: 3, 22, 64, 838, 2,
a:  190
b:  63100
c: 55, 22, 64, 838, 2,
a:  200
b:  63500
c: 3, 22, 64, 838, 2,
a:  200
b:  63500
c: 3, 22, 64, 838, 2,
a:  200
b:  63500
c: 3, 22, 64, 838, 2,

For fun using awk, match and this excellent answer :

awk ' 
function find_all(str, patt) {
        while (match(str, patt, a) > 0) {
            for (i=1; i in a; i++) print a[i]
            str = substr(str, RSTART+RLENGTH)
        }
    }
{
    print "Catching DataA"
    find_all($0, "DataA\":\"([0-9]*)")
    print "Catching DataBee"
    find_all($0, "DataBee\":\"([0-9]*)")
    print "Catching DataC"
    find_all($0, "DataC\":.([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)")
}
' file

Returns

Catching DataA
200
190
200
200
200
Catching DataBee
63500
63100
63500
63500
63500
Catching DataC
3
22
64
838
2
55
22
64
838
2
3
22
64
838
2
3
22
64
838
2
3
22
64
838
2

Now you've seen how ugly it is, see how easy it can be using python :

import json

data_str = '[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]'

while data_str:
    data, index = json.JSONDecoder().raw_decode(data_str)
    for element in data:
        print("DataA: ", element["DataA"])
        print("DataBee: ", element["DataBee"])
        print("DataC: ", element["DataC"])
    data_str = data_str[index:]

Returns :

DataA:  200
DataBee:  63500
DataC:  [3, 22, 64, 838, 2]
DataA:  190
DataBee:  63100
DataC:  [55, 22, 64, 838, 2]
DataA:  200
DataBee:  63500
DataC:  [3, 22, 64, 838, 2]
DataA:  200
DataBee:  63500
DataC:  [3, 22, 64, 838, 2]
DataA:  200
DataBee:  63500
DataC:  [3, 22, 64, 838, 2]

This solution is not only cleaner, it is more robust if you have unexpected results or unexpected formatting.

Upvotes: 4

Related Questions