Idr
Idr

Reputation: 6250

Decode Raw URL in Hadoop Hive, prefer non-Java solutions

I am querying against a hive table that has a field out_url that has raw url encoding, like:

http%3A%2F%2Fwww.example.com%2Findex.php%3Fpage%3D260%26id%3D22

I want to extract only the domain, and that would be possible with `parse_url(out_url, 'HOST') if the url was not raw-encoded.

To get around this, I'm doing this ugly double regexp replacement like:

parse_url(regexp_replace(regexp_replace(out_url, '%3A', ':'), '%2F', '/'), 'HOST')

that converts the %3A to : and %2F to / and then extracts the domain. I understand that I could write a Java UDF to do this, but that's not a great option for me as I currently mostly suck at writing Java.

Ideas? Is it possible to write a Python UDF?

Upvotes: 3

Views: 6022

Answers (1)

benjguin
benjguin

Reputation: 1516

there seems to be an easy way here which uses Java reflection: http://mail-archives.apache.org/mod_mbox/hive-user/201109.mbox/%3C15C962F3417BF94ABEAB2314AF92A16A1FF9CE@SVR-PR-MB2.cb.careerbuilder.com%3E

SELECT reflect("java.net.URLDecoder", "decode", field_name) FROM table;

Upvotes: 12

Related Questions