Parsing URLs in Postgres

Question

I'm having trouble parsing urls in Postgres. I have a database full of customers and urls associated with them. I need an array of the unique domains associated with each customer. I'd love to be able to do the parsing in my query instead of dumping my results to Python and parsing it there.

In the postgres docs I found this, but can't figure out how to incorporate it into my query:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  

  alias   |  description  |            token               
----------+---------------+------------------------------  
 protocol | Protocol head | http://  
 url      | URL           | example.com/stuff/index.html  
 host     | Host          | example.com  
 url_path | URL path      | /stuff/index.html

(http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html)

I'm starting with a table, like this:

customer_id | url 
-------------+--------------------   
000001      | www.example.com/fish  
000001      | www.example.com/potato  
000001      | www.potato.com/artichoke
000002      | www.otherexample.com

My code so far:

SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id

Which gives me:

customer_id | unique_domains
-----------------------------
000001      | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002      | {www.otherexample.com}

I want a table like this:

customer_id | unique_domains
-----------------------------
000001      | {example.com, potato.com}
000002      | {otherexample.com}

Working on a PostgreSQL 9.3.3 database that lives on AWS.

khampson · Accepted Answer

The document you linked above is for use with a Postgres text search parser. That requires a separate configuration to setup, and may be more overhead and/or a different sort of thing than you are looking for.

If you do want to go that route, to setup a text parser, you can find more info here:

http://www.postgresql.org/docs/9.3/static/sql-createtsconfig.html

However, if you want to do the parsing inline in Postgres, I would recommend using a procedural Postgres language, where you can import parsing libraries in that language.

You mentioned Python, so you could use PL/Python and a url parsing library such as urlparse (called urllib.parse in Python 3).

More info about urlparse

That includes this example code:

>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

Going beyond that example, you can get the hostname with the hostname member:

>>> print o.hostname
www.cwi.nl

If you want properly parse out just the domain name (there are lots of edge cases and variants -- i.e. minus the www and any other assorted parts that may be there -- an approach such as in this answer would be best.

For more information about setting up PL/Python, you can go here:

http://www.postgresql.org/docs/9.3/static/plpython.html

So, that's how you could do the parsing in Postgres

instead of dumping my results to Python and parsing it there

It ends up coming a bit full circle with the PL/Python, but if you really want to do the parsing within SQL (especially for performance reasons, say, across a large data set), going with PL/Python may be worth the extra effort.

Parsing URLs in Postgres

Answers (2)

Related Questions