Reputation: 21
I'm trying to run a hive query that will produce a table with domain, key, value and count, grouped by the unique combination of domain/key/value.
Example of the data:
http://www.aaa.com/path?key_a=5&key_b=hello&key_c=today&key_d=blue
http://www.aaa.com/path?key_a=5&key_b=goodb&key_c=yestr&key_d=blue
http://www.bbb.com/path?key_a=5&key_b=hello&key_c=today&key_d=blue
http://www.bbb.com/path?key_a=5&key_b=goodb&key_c=ystrd
Desired output:
aaa.com | key_a | 5 | 2
aaa.com | key_b | hello | 1
aaa.com | key_b | goodb | 1
aaa.com | key_c | today | 1
aaa.com | key_c | yestr | 1
aaa.com | key_d | blue | 2
bbb.com | key_a | 5 | 2
bbb.com | key_b | hello | 1
bbb.com | key_b | goodb | 1
bbb.com | key_c | today | 1
bbb.com | key_c | ystrd | 1
bbb.com | key_d | blue | 1
Here's what I've been using:
"select parse_url(url,'HOST'), str_to_map(parse_url(url,'QUERY'),'&','='), count(1) from url_table group by select parse_url(url,'HOST'), str_to_map(parse_url(url,'QUERY'),'&','=') limit 10;"
Where am I going wrong? Specifically where I think I'm messing up is: str_to_map(parse_url(url,'QUERY'),'&','=') because I don't know how to break apart the query string into multiple key-value pairs and then group correctly.
Upvotes: 2
Views: 16444
Reputation: 121
Parse URL Tuples
Something like this will resolve your queries
SELECT
count(*), host, path, query
FROM (
SELECT b.*
FROM src
LATERAL VIEW parse_url_tuple(completeurl, 'HOST',
'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id
)
GROUP BY host, path, query ;
Refer here for more details about https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parse_url_tuple
Upvotes: 0
Reputation: 34184
You could achieve this with the help of Lateral View and explode.
This should work :
hive> select parse_url(url,'HOST') as host, v.key as key, v.val,
count(*) as count from url u LATERAL VIEW
explode(str_to_map(parse_url(url,'QUERY'),'&','=')) v as key, val
group by parse_url(url, 'HOST'), v.key, v.val;
Upvotes: 2
Reputation: 1
I have verified the query below should work:
SELECT
parse_url(url, 'HOST') AS host,
q.key AS key,
q.val AS val,
COUNT(*)
FROM <your_table_with_url_as_a_field>
LATERAL VIEW explode(str_to_map(parse_url(url,'QUERY'),'&','=')) q AS key, val
WHERE parse_url(url,'QUERY') IS NOT NULL
GROUP BY parse_url(url, 'HOST'), q.key, q.val
ORDER BY host, key, val;
Upvotes: 0