Alice Everett
Alice Everett

Reputation: 375

SQL statement to create a dictionary based on an unorganized table

I have a data-heavy table (about 1TB in size) based on the following schema:

CREATE TABLE my_table(
  col1  character varying, 
  col2 character varying, 
  col3 character varying
);

The aforementioned table contains data of the following form:

col1     col2    col3
<abc1>   <def1>  <ghi1>
<abc1>   <g1>    <g2>
<g3>     <g1>    <g4>

Now, what I'd like to do is create a dictionary (another table) which maps all of the unique strings of table my_table (present in col1, col2 and col3) to an integer value. e.g.:

<abc1>    1
<def1>    2
<ghi1>    3
<g1>      4
<g2>      5
<g3>      6
<g4>      7

I know I can do this using my programming language of choice (Python, for example). What I'd like to know is whether it possible to achieve something like this using pure SQL?

Upvotes: 0

Views: 227

Answers (3)

wildplasser
wildplasser

Reputation: 44220

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE bigstrings
        ( col1 varchar
        , col2 varchar
        , col3 varchar
        );

INSERT INTO bigstrings(col1, col2, col3) VALUES
 ('abc1','def1','ghi1')
,('abc1','g1','g2')
,('g3','g1','g4')
        ;

CREATE TABLE allstrings
        ( num BIGSERIAL NOT NULL PRIMARY KEY
        , string varchar NOT NULL UNIQUE
        );

CREATE TABLE nostrings
        ( col1 BIGINT REFERENCES allstrings(num)
        , col2 BIGINT REFERENCES allstrings(num)
        , col3 BIGINT REFERENCES allstrings(num)
        );

INSERT INTO allstrings( string)
SELECT DISTINCT col1 FROM bigstrings bs
 -- not needed on empty allstrings table.
 -- WHERE NOT EXISTS ( SELECT * FROM allstrings nx WHERE nx.string = bs.col1)
        ;

INSERT INTO allstrings( string)
SELECT DISTINCT col2 FROM bigstrings bs
WHERE NOT EXISTS ( SELECT * FROM allstrings nx WHERE nx.string = bs.col2)
        ;

INSERT INTO allstrings( string)
SELECT DISTINCT col3 FROM bigstrings bs
WHERE NOT EXISTS ( SELECT * FROM allstrings nx WHERE nx.string = bs.col3)
        ;

INSERT INTO nostrings(col1,col2,col3)
SELECT s1.num, s2.num, s3.num
FROM bigstrings bs
JOIN allstrings s1 ON s1.string = bs.col1
JOIN allstrings s2 ON s2.string = bs.col2
JOIN allstrings s3 ON s3.string = bs.col3
        ;

SELECT * FROM nostrings;

RESULT:

 col1 | col2 | col3 
------+------+------
    2 |    3 |    6
    2 |    4 |    7
    1 |    4 |    5
(3 rows)

Upvotes: 1

misterManager
misterManager

Reputation: 1174

Ok, I'm coming from a SQL Server perspective on this, but the concepts will largely be the same. It is my understanding that SERIAL is equivalent to IDENTITY in SQL Server such that it will provide you with an auto-incrementing key. My solution would likely be as follows:

CREATE TABLE DistinctStrings
(
   Id SERIAL NOT NULL,
   String CHARACTER VARYING NOT NULL
)

I assume you want to keep that table around, so it would obviously be part of the database and not recreated every time you fill it.

Insert to this table like this

INSERT INTO DistinctStrings (String)
SELECT col1 FROM my_table UNION
SELECT col2 FROM my_table UNION
SELECT col3 FROM my_table 

The use of the UNION (as opposed to UNION ALL) will give you the distinctness you are looking for. The use of SERIAL will give you your IDs.

Upvotes: 1

jdotjdot
jdotjdot

Reputation: 17042

Honestly, you'd have an easier time doing this in SQL.

For example, this does the trick:

SELECT DISTINCT ROW_NUMBER() OVER (ORDER BY col ASC) AS row, col FROM 

(SELECT col1 AS col FROM mytable
UNION
SELECT col2 AS col FROM mytable
UNION
SELECT col3 AS col FROM mytable
) AS newtable

ORDER BY row;

See the SQLFiddle of this.

I'm not 100% confident that using the UNIONs is the most efficient way of doing this, but I do know that it meets your criteria of retrieving distinct strings from across all three columns and assigning them all numbers. Doing this in Python would be far slower.

Upvotes: 2

Related Questions