Extracting values from non-standard markup strings in PostgreSQL

Question

Unfortunately, I have a table like the following:

DROP TABLE IF EXISTS my_list;
CREATE TABLE my_list (index int PRIMARY KEY, mystring text, status text);

INSERT INTO my_list    
(index, mystring,                                           status) VALUES 
   (12, '',                                                    'D'), 
   (14, '[id] 5',                                              'A'), 
   (15, '[id] 12[num] 03952145815',                            'C'), 
   (16, '[id] 314[num] 03952145815[name] Sweet',               'E'), 
   (19, '[id] 01211[num] 03952145815[name] Home[oth] Alabama', 'B');

Is there any trick to get out number of [id] as integer from the mystring text shown above? As though I ran the following query:

SELECT index, extract_id_function(mystring), status FROM my_list;

and got results like:

12  0     D  
14  5     A 
15  12    C 
16  314   E 
19  1211  B

Preferably with only simple string functions and if not regular expression will be fine.

Craig Ringer · Accepted Answer

If I understand correctly, you have a rather unconventional markup format where [id] is followed by a space, then a series of digits that represents a numeric identifier. There is no closing tag, the next non-numeric field ends the ID.

If so, you're going to be able to do this with non-regexp string ops, but only quite badly. What you'd really need is the SQL equivalent of strtol, which consumes input up to the first non-digit and just returns that. A cast to integer will not do that, it'll report an error if it sees non-numeric garbage after the number. (As it happens I just wrote a C extension that exposes strtol for decoding hex values, but I'm guessing you don't want to use C extensions if you don't even want regex...)

It can be done with string ops if you make the simplifying assumption that an [id] nnnn tag always ends with either end of string or another tag, so it's always [ at the end of the number. We also assume that you're only interested in the first [id] if multiple appear in a string. That way you can write something like the following horrible monstrosity:

select
  "index",
  case 
    when next_tag_idx > 0 then substring(cut_id from 0 for next_tag_idx) 
    else cut_id 
  end AS "my_id",
  "status"
from (
  select 
    position('[' in cut_id) AS next_tag_idx,
    *
  from (
    select 
      case 
        when id_offset = 0 then null 
        else substring(mystring from id_offset + 4) 
      end AS cut_id,
      *
    from (
      select
        position('[id] ' in mystring) AS id_offset,
        *
      from my_list
    ) x
  ) y
) z;

(If anybody ever actually uses that query for anything, kittens will fall from the sky and splat upon the pavement, wailing in horror all the way down).

Or you can be sensible and just use a regular expression for this kind of string processing, in which case your query (assuming you only want the first [id]) is:

regress=> SELECT
            "index", 
            coalesce((SELECT (regexp_matches(mystring, '$$id$$\s?(\d+)'))[1])::integer, 0) AS my_id,
            status 
          FROM my_list;
 index | my_id          | status 
-------+----------------+--------
    12 | 0              | D
    14 | 5              | A
    15 | 12             | C
    16 | 314            | E
    19 | 01211          | B
(5 rows)

Update: If you're having issues with unicode handling in regex, upgrade to Pg 9.2. See https://stackoverflow.com/a/14293924/398670

Extracting values from non-standard markup strings in PostgreSQL

Answers (1)

Related Questions