user2839682
user2839682

Reputation: 21

How to parse url query string in Hive to multiple key-value pairs

I'm trying to run a hive query that will produce a table with domain, key, value and count, grouped by the unique combination of domain/key/value.

Example of the data:

http://www.aaa.com/path?key_a=5&key_b=hello&key_c=today&key_d=blue
http://www.aaa.com/path?key_a=5&key_b=goodb&key_c=yestr&key_d=blue
http://www.bbb.com/path?key_a=5&key_b=hello&key_c=today&key_d=blue
http://www.bbb.com/path?key_a=5&key_b=goodb&key_c=ystrd

Desired output:

aaa.com | key_a | 5 | 2
aaa.com | key_b | hello | 1
aaa.com | key_b | goodb | 1
aaa.com | key_c | today | 1
aaa.com | key_c | yestr | 1
aaa.com | key_d | blue | 2
bbb.com | key_a | 5 | 2
bbb.com | key_b | hello | 1
bbb.com | key_b | goodb | 1
bbb.com | key_c | today | 1
bbb.com | key_c | ystrd | 1
bbb.com | key_d | blue | 1

Here's what I've been using:

"select parse_url(url,'HOST'), str_to_map(parse_url(url,'QUERY'),'&','='), count(1) from url_table group by select parse_url(url,'HOST'), str_to_map(parse_url(url,'QUERY'),'&','=') limit 10;"

Where am I going wrong? Specifically where I think I'm messing up is: str_to_map(parse_url(url,'QUERY'),'&','=') because I don't know how to break apart the query string into multiple key-value pairs and then group correctly.

Upvotes: 2

Views: 16444

Answers (3)

user1388547
user1388547

Reputation: 121

Parse URL Tuples

  1. src table contains completeurl
  2. Then apply group by host, path, query

Something like this will resolve your queries

SELECT 
  count(*), host, path, query
FROM ( 
  SELECT b.*
  FROM src 
  LATERAL VIEW parse_url_tuple(completeurl, 'HOST',
           'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id
     )
GROUP BY host, path, query ;

Refer here for more details about https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-parse_url_tuple

Upvotes: 0

Tariq
Tariq

Reputation: 34184

You could achieve this with the help of Lateral View and explode.

This should work :

hive> select parse_url(url,'HOST') as host, v.key as key, v.val,
count(*) as count from url u LATERAL VIEW
explode(str_to_map(parse_url(url,'QUERY'),'&','=')) v as key, val
group by parse_url(url, 'HOST'), v.key, v.val;

Upvotes: 2

Yong Sheng
Yong Sheng

Reputation: 1

I have verified the query below should work:

SELECT
  parse_url(url, 'HOST') AS host,
  q.key AS key,
  q.val AS val,
  COUNT(*)
FROM <your_table_with_url_as_a_field>
LATERAL VIEW explode(str_to_map(parse_url(url,'QUERY'),'&','=')) q AS key, val
WHERE parse_url(url,'QUERY') IS NOT NULL
GROUP BY parse_url(url, 'HOST'), q.key, q.val
ORDER BY host, key, val;

Upvotes: 0

Related Questions