Valerie
Valerie

Reputation: 25

regexp_extract get all matches

I'm querying a column in hive. This columns contains a json payload and is in this format:

{"aaa":"xxx", "status":"yes", "bbb":"xyx", "ccc":"yxy", "status":"no", "status":"maybe", "ddd":"zyz"}

I'm trying to extract all the status values that appear in different points of the payload. I'm using the following:

select regexp_extract (payload,'(?<=status":")(.*?)"') as Status

This however only returns the first occurrence of status and is missing all the next. How do I get the values after every time Status is repeated?

Thanks

Upvotes: 2

Views: 8466

Answers (2)

Kaushik Nayak
Kaushik Nayak

Reputation: 31648

Use a combination of REGEXP_REPLACE, SPLIT,EXPLODE and get_json_object()

SELECT *
FROM (
    SELECT get_json_object(r, '$.status') AS STATUS
    FROM yourtable t 
     LATERAL VIEW explode(split(regexp_replace(t.payload, ',', '},{'), ',')) t1 AS r
    ) a
WHERE STATUS IS NOT NULL;

This gives

status
yes
no
maybe

Upvotes: 1

ekerner
ekerner

Reputation: 5840

I guess youre doing this because the json is malformatted containing dulicate keys.

Regular expression using anything-but [^abc] notation:

/"status":"([^"]*)"/g

That will also collect empty string from "status":"", if you want only entries containing a value use:

/"status":"([^"]+)"/g

Original and best best regex reference is of course perldoc perlre : https://perldoc.perl.org/perlre.html

Upvotes: 0

Related Questions