Reputation: 13529
I have two lists of names for locations, with slightly different spelling, capitalisation, etc.
I'm trying to match each site in the first list to the most similar one in the second list.
SELECT name1, name2 FROM table1, table2
WHERE levenshtein(name1, name2) < 3
Is along the right lines, but throws up a few false positives. Most of those would be avoided if I could select the record with the smallest Levenshtein distance.
In Python I would be looking to do:
for row1 in table1:
min_dist = 100
for row2 in table2:
dist = levenshtein(row1.name, row2.name)
if dist < min_dist:
min_dist = dist
closest_name = row2.name
row1.newcolumn = closest_name
Is this possible in postgresql
?
Upvotes: 0
Views: 692
Reputation: 23850
If I understand your question correctly, for each name1
, you are looking for the closest name2
. That can be accomplished with a subquery, something like that:
SELECT
name1,
(SELECT name2
FROM table2
WHERE levenshtein(name1, name2) < 3
ORDER BY levenshtein(name1, name2)
LIMIT 1
)
FROM table1
I am assuming that you have defined levenshtein
yourself.
Upvotes: 1