Reputation: 11
I'm working on processing some results obtained from a curl command in AWK but despite reading about match and regexps I'm still having some issues. I've got everything written, but in a really hackish way that's using a lot of substr and really basic match usage without capturing anything with a regexp.
My real data is a bit more complicated, but here's a simplified version. Assume the following is stored in a string, str
:
[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]
Some things to note about this data:
Note that there are 3 "sets" of data delimited by {} in the first brackets [] and 2 sets in the second brackets. The string always has at least one set of data in each set of brackets, and at least one set of brackets (i.e. it will never be the empty string and will always have SOME valid data in it)
Brackets are also used for the DataC data, so those need to be considered in some way
No punctuation will ever appear in the string aside from delimiters -- all actual data is alphanumeric
The fields DataA, DataBee, and DataC will always have those names
The data for DataC will always be exactly 5 numbers, separated by commas
What I'd like to do is write a loop that will go through the string and pull out the values -- a = whatever DataA is (200 in the first case), b = whatever DataBee is (63500 in the first case), and c[1] through c[5] containing the values from DataC.
I feel like if I could just get ideas about how to do this for the above data I could run with it to adapt it to my needs. As of right now the loop I have for this using substr is like 30 lines long :(
Upvotes: 1
Views: 93
Reputation: 47099
I would recommend using jq
, e.g.:
jq -c '.[]' <<<"$str"
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}
To extract DataC
:
jq -c '.[] | .DataC' <<<"$str"
Output:
[3,22,64,838,2]
[55,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]
[3,22,64,838,2]
Upvotes: 1
Reputation: 5006
For fun using awk :
I use "complex" FS and RS variables to split the json. This way, I have one value max per column, and 1 data per line (DataA, DataBee, DataC).
To understand the usage of the FS and RS, see how this command works :
awk -F",|\":\"|:\\\[" '
{$1=$1}1
' OFS="\t" RS="\",\"|},{|\\\]" file
(you can replace file
with <(curl <your_url>)
or <(echo <your_json_str>)
)
Returns :
[{"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
"DataA 190
DataBee 63100
DataC" 55 22 64 838 2
"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
}
[{"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
"DataA 200
DataBee 63500
DataC" 3 22 64 838 2
}
Now it looks like something I can use with awk :
awk -F",|\":\"|:\\\[" '
/DataA/{a=$2}
/DataBee/{b=$2}
/DataC/{for(i=2;i<=NF;i++){c[i-1]=$i}}
a!=""&&b!=""&&c[1]!=""{
print "a: ", a;
print "b: ", b;
printf "c: ";
for(i in c){
printf "%s, ", c[i]
};
print "";
a=""; b=""; c[1]=""
}
' RS="\",\"|},{|\\\]" file
This command stores the value inside variables and prints them when a and b and c are set.
Returns :
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 190
b: 63100
c: 55, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
a: 200
b: 63500
c: 3, 22, 64, 838, 2,
For fun using awk, match and this excellent answer :
awk '
function find_all(str, patt) {
while (match(str, patt, a) > 0) {
for (i=1; i in a; i++) print a[i]
str = substr(str, RSTART+RLENGTH)
}
}
{
print "Catching DataA"
find_all($0, "DataA\":\"([0-9]*)")
print "Catching DataBee"
find_all($0, "DataBee\":\"([0-9]*)")
print "Catching DataC"
find_all($0, "DataC\":.([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)")
}
' file
Returns
Catching DataA
200
190
200
200
200
Catching DataBee
63500
63100
63500
63500
63500
Catching DataC
3
22
64
838
2
55
22
64
838
2
3
22
64
838
2
3
22
64
838
2
3
22
64
838
2
Now you've seen how ugly it is, see how easy it can be using python :
import json
data_str = '[{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"190","DataBee":"63100","DataC":[55,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}][{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]},{"DataA":"200","DataBee":"63500","DataC":[3,22,64,838,2]}]'
while data_str:
data, index = json.JSONDecoder().raw_decode(data_str)
for element in data:
print("DataA: ", element["DataA"])
print("DataBee: ", element["DataBee"])
print("DataC: ", element["DataC"])
data_str = data_str[index:]
Returns :
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 190
DataBee: 63100
DataC: [55, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
DataA: 200
DataBee: 63500
DataC: [3, 22, 64, 838, 2]
This solution is not only cleaner, it is more robust if you have unexpected results or unexpected formatting.
Upvotes: 4