Reputation: 396
I'm having trouble parsing urls in Postgres. I have a database full of customers and urls associated with them. I need an array of the unique domains associated with each customer. I'd love to be able to do the parsing in my query instead of dumping my results to Python and parsing it there.
In the postgres docs I found this, but can't figure out how to incorporate it into my query:
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html
(http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html)
I'm starting with a table, like this:
customer_id | url
-------------+--------------------
000001 | www.example.com/fish
000001 | www.example.com/potato
000001 | www.potato.com/artichoke
000002 | www.otherexample.com
My code so far:
SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id
Which gives me:
customer_id | unique_domains
-----------------------------
000001 | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002 | {www.otherexample.com}
I want a table like this:
customer_id | unique_domains
-----------------------------
000001 | {example.com, potato.com}
000002 | {otherexample.com}
Working on a PostgreSQL 9.3.3 database that lives on AWS.
Upvotes: 4
Views: 10690
Reputation: 1301
You can create a function
create or replace function fn_get_url_path(in_link text) returns text as
$$
begin
return (select token
from (select token, alias
from ts_parse('default', in_link)
natural join ts_token_type('default')) tokens
where tokens.alias = 'url_path');
end;
$$ language PLpgSQL;
Then use the function:
select fn_get_url_path(column1)
from (values ('https://www.example.com'),
('https://www.example.com/test.html'),
('https://www.example.com/test?a=1'),
('this doesnt work and will return null')) a
To get the result:
/*
null,
/test.html
/test?a=1
null
*/
Upvotes: 1
Reputation: 15316
The document you linked above is for use with a Postgres text search parser. That requires a separate configuration to setup, and may be more overhead and/or a different sort of thing than you are looking for.
If you do want to go that route, to setup a text parser, you can find more info here:
http://www.postgresql.org/docs/9.3/static/sql-createtsconfig.html
However, if you want to do the parsing inline in Postgres, I would recommend using a procedural Postgres language, where you can import parsing libraries in that language.
You mentioned Python, so you could use PL/Python and a url parsing library such as urlparse (called urllib.parse in Python 3).
That includes this example code:
>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
Going beyond that example, you can get the hostname with the hostname member:
>>> print o.hostname
www.cwi.nl
If you want properly parse out just the domain name (there are lots of edge cases and variants -- i.e. minus the www and any other assorted parts that may be there -- an approach such as in this answer would be best.
For more information about setting up PL/Python, you can go here:
http://www.postgresql.org/docs/9.3/static/plpython.html
So, that's how you could do the parsing in Postgres
instead of dumping my results to Python and parsing it there
It ends up coming a bit full circle with the PL/Python, but if you really want to do the parsing within SQL (especially for performance reasons, say, across a large data set), going with PL/Python may be worth the extra effort.
Upvotes: 4