Reputation: 25
I'm querying a column in hive
.
This columns contains a json
payload and is in this format:
{"aaa":"xxx", "status":"yes", "bbb":"xyx", "ccc":"yxy", "status":"no", "status":"maybe", "ddd":"zyz"}
I'm trying to extract all the status values that appear in different points of the payload. I'm using the following:
select regexp_extract (payload,'(?<=status":")(.*?)"') as Status
This however only returns the first occurrence of status and is missing all the next. How do I get the values after every time Status is repeated?
Thanks
Upvotes: 2
Views: 8466
Reputation: 31648
Use a combination of REGEXP_REPLACE
, SPLIT,EXPLODE
and get_json_object()
SELECT *
FROM (
SELECT get_json_object(r, '$.status') AS STATUS
FROM yourtable t
LATERAL VIEW explode(split(regexp_replace(t.payload, ',', '},{'), ',')) t1 AS r
) a
WHERE STATUS IS NOT NULL;
This gives
status
yes
no
maybe
Upvotes: 1
Reputation: 5840
I guess youre doing this because the json is malformatted containing dulicate keys.
Regular expression using anything-but [^abc]
notation:
/"status":"([^"]*)"/g
That will also collect empty string from "status":"", if you want only entries containing a value use:
/"status":"([^"]+)"/g
Original and best best regex reference is of course perldoc perlre
:
https://perldoc.perl.org/perlre.html
Upvotes: 0