luckydonald
luckydonald

Reputation: 6846

Get each <tag> in String - stackexchange database

Mockup code for my problem:

SELECT Id FROM Tags WHERE TagName IN '<osx><keyboard><security><screen-lock>'

The problem in detail

I am trying to get tags used in 2011 from apple.stackexchange data. (this query)

As you can see, tags in tag changes are stored as plain text in the Text field. example output with stackexchange tags

<tag1><tag2><tag3>
<osx><keyboard><security><screen-lock>

How can I create a unique list of the tags, to look them up in the Tags table, instead of this hardcoded version:

SELECT * FROM Tags
  WHERE TagName = 'osx' 
     OR TagName = 'keyboard' 
     OR TagName = 'security'

Here is a interactive example.

Stackexchange uses T-SQL, my local copy is running under postgresql using Postgres app version 9.4.5.0.

Upvotes: 1

Views: 173

Answers (2)

Kamil Gosciminski
Kamil Gosciminski

Reputation: 17147

I've simplified the data to the relevant column only and called it tags to present the example.

Sample data

create table posthistory(tags text);
insert into posthistory values
  ('<lion><backup><time-machine>'),
  ('<spotlight><alfred><photo-booth>'),
  ('<lion><pdf><preview>'),
  ('<pdf>'),
  ('<asd>');

Query to get unique list of tags

SELECT DISTINCT
  unnest(
    regexp_split_to_array(
      trim('><' from tags), '><'
    )
  )
FROM
  posthistory

First we're removing all occurences of leading and trailing > and < signs from each row, then using regexp_split_to_array() function to get values into arrays, and then unnest() to expand an array to a set of rows. Finally DISTINCT eliminates duplicate values.

Presenting SQLFiddle to preview how it works.

Upvotes: 1

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656401

Assuming this table definition:

CREATE TABLE posthistory(post_id int PRIMARY KEY, tags text);

Depending on what you want exactly:

To convert the string to an array, trim leading and trailing '<>', then treat '><' as separator:

SELECT *, string_to_array(trim(tags, '><'), '><') AS tag_arr
FROM   posthistory;

To get list of unique tags for whole table (I guess you want this):

SELECT DISTINCT tag
FROM   posthistory, unnest(string_to_array(trim(tags, '><'), '><')) tag;

The implicit LATERAL join requires Postgres 9.3 or later.

This should be substantially faster than using regular expressions. If you want to try regexp, use regexp_split_to_table() instead of regexp_split_to_array() followed by unnest() like suggested in another answer:

SELECT DISTINCT tag
FROM   posthistory, regexp_split_to_table(trim(tags, '><'), '><') tag;

Also with implicit LATERAL join. Related:

To search for particular tags:

SELECT *
FROM   posthistory
WHERE  tags LIKE '%<security>%'
AND    tags LIKE '%<osx>%';

SQL Fiddle.

Applied to your search in T-SQL in our data explorer:

SELECT TOP 100
       PostId, UserId, Text AS Tags FROM PostHistory
WHERE  year(CreationDate) = 2011
AND    PostHistoryTypeId IN (3  -- initial tags
                           , 6  -- edit tags
                           , 9) -- rollback tags
AND    Text LIKE ('%<' + ##TagName:String?postgresql## + '>%');

(T-SQL syntax uses the non-standard + instead of ||.)
https://data.stackexchange.com/apple/query/edit/417055

Upvotes: 1

Related Questions