oldhomemovie
oldhomemovie

Reputation: 15129

Find non-matching characters from string1 and string2

I have two long strings (5684400 characters each). They are almost the same: only a few characters are different.

I need to find what those characters are.

What is the fastest way to do so in PostgreSQL?

Upvotes: 0

Views: 148

Answers (2)

klin
klin

Reputation: 121604

That is not a task for a database server. However, if you do not want to transmit enormous strings from a remote server, install one of the available procedural languages, for example Python:

create or replace function diff_str(str1 text, str2 text)
returns setof text language plpython3u as $$
    res = []
    for i, c in enumerate(str1):
        if c != str2[i]:
            res += ('{}: {}->{}'.format(i+1, str1[i], str2[i]),)
    return res;
$$;

select * from diff_str('abcdefghijk', 'abcXefgYijk');

 diff_str 
----------
 4: d->X
 8: h->Y
(2 rows)

or JavaScript (plv8):

create or replace function diff_str_v8(str1 text, str2 text)
returns setof text language plv8 as $$
    for (var i = 0; i < str1.length; i++)
        if (str1[i] != str2[i])
            plv8.return_next(i+1+ ': '+ str1[i]+ '->'+ str2[i]);
$$;

The functions were tested on 12-million-character strings. Plv8 needed circa 0.2s, Python about 1.5s.

Upvotes: 2

user330315
user330315

Reputation:

A brute-force method is replace all matching cto turn the strings into a set and then use a full outer join and find only those that are different.

E.g. to compare 'Hello, world' and 'Hello world.' you could use

with s1(c) as (
  select * 
  from unnest(regexp_split_to_array('Hello, world', ''))
), s2 (c) as (
  select * 
  from unnest(regexp_split_to_array('Hello world.', ''))
)
select coalesce(s1.c, s2.c) as different
from s1
  full outer join s2 on s1.c = s2.c
where s1 is distinct from s2;

The above query returns:

different
---------
,        
.        

If you need this for a one-time thing, this is probably good enough. But this is not going to scale well for a lot of strings.

Upvotes: 1

Related Questions