Ryder Brooks
Ryder Brooks

Reputation: 2119

postgresql find common patterns of minimum length between strings

I'm using postgresql 9.3 and I'm trying to inner join two tables on common string patterns of a minimum length.

Also I'm a noob to SQL try to be patient if possible.

For example:

TABLE 1

ID        DATA
1         '1234,5678,1234,1111'
2         '1111,2222'
3         '4321'



TABLE 2

IDa        DATA
1a         '1111,2222,1234,5678,4321'
2a         '1111,3837,2222'
3a         '4321'

joining DATA column on strings matching more than 9 chars would yield:

IDa    ID     DATA
1a     2     '1111,2222'
1a     1     '1234,5678'

I had some success using LIKE but I can't force a minimum match length condition(or at least I don't know how). I'm assuming a regex is the solution here but I haven't been able to write one that accomplished what I'm looking for.

Upvotes: 0

Views: 420

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656714

Your examples match on 2 x 4 characters, not more than 9 chars.

I suggest using array types (int[]) instead of character types, in combination with the handy intersection operator & from the additional module intarray. More details:
- Error when creating unaccent extension on PostgreSQL
- Postgresql intarray error: undefined symbol: pfree

Query could look like this

SELECT t2.ida, t1.id, t1.data & t2.data AS intersecting_data
FROM   tbl1 t1
JOIN   tbl2 t2 ON array_length(t1.data & t2.data, 1) = 2; -- or "> 1" ?

Not very efficient, this kind of cross join does not scale well.

Normalize

Faster alternative: a normalized schema with 1 row per data item. Then the operation boils down to a .

tbl1_data

tbl1_id  item
1        1234
1        5678
1        1234
1        1111
2        1111
...

tbl2_data

tbl1_id  item
1a       1111
1a       2222
...

Then the query could be:

SELECT tbl1_id, tbl2_id, array_agg(item) AS data
FROM   tbl1_data d1
JOIN   tbl2_data d2 USING (item)
GROUP  BY 1,2
HAVING count(*) = 2;  --  or "> 1" ?

Upvotes: 1

Related Questions