davidshere
davidshere

Reputation: 396

Parsing URLs in Postgres

I'm having trouble parsing urls in Postgres. I have a database full of customers and urls associated with them. I need an array of the unique domains associated with each customer. I'd love to be able to do the parsing in my query instead of dumping my results to Python and parsing it there.

In the postgres docs I found this, but can't figure out how to incorporate it into my query:

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  

  alias   |  description  |            token               
----------+---------------+------------------------------  
 protocol | Protocol head | http://  
 url      | URL           | example.com/stuff/index.html  
 host     | Host          | example.com  
 url_path | URL path      | /stuff/index.html  

(http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html)

I'm starting with a table, like this:

customer_id | url 
-------------+--------------------   
000001      | www.example.com/fish  
000001      | www.example.com/potato  
000001      | www.potato.com/artichoke
000002      | www.otherexample.com

My code so far:

SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id

Which gives me:

customer_id | unique_domains
-----------------------------
000001      | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002      | {www.otherexample.com}

I want a table like this:

customer_id | unique_domains
-----------------------------
000001      | {example.com, potato.com}
000002      | {otherexample.com}

Working on a PostgreSQL 9.3.3 database that lives on AWS.

Upvotes: 4

Views: 10690

Answers (2)

Pål Thingbø
Pål Thingbø

Reputation: 1301

You can create a function

create or replace function fn_get_url_path(in_link text) returns text as
$$
begin
  return (select token
          from (select token, alias
                from ts_parse('default', in_link)
                       natural join ts_token_type('default')) tokens
          where tokens.alias = 'url_path');

end;
$$ language PLpgSQL;

Then use the function:

select fn_get_url_path(column1)
from (values ('https://www.example.com'),
         ('https://www.example.com/test.html'),
         ('https://www.example.com/test?a=1'),
         ('this doesnt work and will return null')) a

To get the result:

/*
  null,
  /test.html
  /test?a=1
  null
*/

Upvotes: 1

khampson
khampson

Reputation: 15316

The document you linked above is for use with a Postgres text search parser. That requires a separate configuration to setup, and may be more overhead and/or a different sort of thing than you are looking for.

If you do want to go that route, to setup a text parser, you can find more info here:

http://www.postgresql.org/docs/9.3/static/sql-createtsconfig.html

However, if you want to do the parsing inline in Postgres, I would recommend using a procedural Postgres language, where you can import parsing libraries in that language.

You mentioned Python, so you could use PL/Python and a url parsing library such as urlparse (called urllib.parse in Python 3).

More info about urlparse

That includes this example code:

>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

Going beyond that example, you can get the hostname with the hostname member:

>>> print o.hostname
www.cwi.nl

If you want properly parse out just the domain name (there are lots of edge cases and variants -- i.e. minus the www and any other assorted parts that may be there -- an approach such as in this answer would be best.

For more information about setting up PL/Python, you can go here:

http://www.postgresql.org/docs/9.3/static/plpython.html

So, that's how you could do the parsing in Postgres

instead of dumping my results to Python and parsing it there

It ends up coming a bit full circle with the PL/Python, but if you really want to do the parsing within SQL (especially for performance reasons, say, across a large data set), going with PL/Python may be worth the extra effort.

Upvotes: 4

Related Questions