Reputation: 15129
I have two long strings (5684400 characters each). They are almost the same: only a few characters are different.
I need to find what those characters are.
What is the fastest way to do so in PostgreSQL?
Upvotes: 0
Views: 148
Reputation: 121604
That is not a task for a database server. However, if you do not want to transmit enormous strings from a remote server, install one of the available procedural languages, for example Python:
create or replace function diff_str(str1 text, str2 text)
returns setof text language plpython3u as $$
res = []
for i, c in enumerate(str1):
if c != str2[i]:
res += ('{}: {}->{}'.format(i+1, str1[i], str2[i]),)
return res;
$$;
select * from diff_str('abcdefghijk', 'abcXefgYijk');
diff_str
----------
4: d->X
8: h->Y
(2 rows)
create or replace function diff_str_v8(str1 text, str2 text)
returns setof text language plv8 as $$
for (var i = 0; i < str1.length; i++)
if (str1[i] != str2[i])
plv8.return_next(i+1+ ': '+ str1[i]+ '->'+ str2[i]);
$$;
The functions were tested on 12-million-character strings. Plv8 needed circa 0.2s, Python about 1.5s.
Upvotes: 2
Reputation:
A brute-force method is replace all matching cto turn the strings into a set and then use a full outer join and find only those that are different.
E.g. to compare 'Hello, world'
and 'Hello world.'
you could use
with s1(c) as (
select *
from unnest(regexp_split_to_array('Hello, world', ''))
), s2 (c) as (
select *
from unnest(regexp_split_to_array('Hello world.', ''))
)
select coalesce(s1.c, s2.c) as different
from s1
full outer join s2 on s1.c = s2.c
where s1 is distinct from s2;
The above query returns:
different
---------
,
.
If you need this for a one-time thing, this is probably good enough. But this is not going to scale well for a lot of strings.
Upvotes: 1