Reputation: 393
I asked a similar question a couple months back. Located here: MySQL Query based on string
The problem I am running into is that this works in only one particular order and in some cases it works too well.
Here is a snippet of the data that this query is filtering out (duplicates are intended, actual data):
- BELLMORE
- ATLANTIC BCH
- ATLANTIC BEACH
- E HILLS
- EAST HILLS
- EAST ROCKAWAY
- FAR ROCKAWAY
- FLORAL PARK
- FLORAL PARK
- HIGHLAND HEIGHTS
- N HIGHLAND HGTS
- NORTH HIGHLAND HEIGHTS
The one query that helped in my last question (MySQL Query based on string) worked well for one instance and failed for the other instance. Here is the query:
select names from tablename group by substring_index(names," ",1)
Which returns:
- BELLMORE
- ATLANTIC BEACH
- EAST HILLS
- FAR ROCKAWAY
- FLORAL PARK
- HIGHLAND HEIGHTS
- N HIGHLAND HGTS
- NORTH HIGHLAND HEIGHTS
The problem with that one is that as you can see it removed a city that it should not have because it was only using the first word to group it by. The one it removed was:
- EAST ROCKAWAY
It was GROUP'ed BY EAST.
As I continue to write this I feel it is nearly impossible because the position of the static city name vs. the variable parts is always changing. Unless you can compare a certain amount of characters. Which is not flawless by far. If someone thinks they may have some insight or have worked with and accomplished such a thing I will appreciate the feedback and guidance. The ending result would be:
- BELLMORE
- ATLANTIC BEACH
- EAST HILLS
- EAST ROCKAWAY
- FAR ROCKAWAY
- FLORAL PARK
- HIGHLAND HEIGHTS
Upvotes: 6
Views: 867
Reputation: 6573
Toughie...
I'd certainly make use of Michael's suggestion and throw into that the possibility of maintaining unique place names in the database.
This would allow you to only use the string distance calculation on adding a new places. You could then manage places by assigning an associate_id to those places that the levenshtein identifies.
Perhaps you could use some other data (like geolocation) to further tune how you associate places. The up shot is perhaps just using the place name may not be the best solution to your problem...
Upvotes: 1
Reputation: 37378
My suggestion will be an expensive query, but hopefully you could use this type of operation to perform an occasional "cleaning" of your data so that it won't be required every time you query this data.
You might consider looking in a Levenshtein distance formula... which is a string metric for measuring the amount of difference between two sequences.
In order to avoid needing to calculate the distance for a cartesian product of your table, you could first narrow the set of cities and addresses to be compared with a quicker sanity check... such as they begin with the same letter, and have a similar length.
Initially, You could then start off by only returning records with a very small Levenshtein distance... You could then select one variation of the matches returned to apply to the other records in order to normalize your data.
You could then gradually increase the distance until you start to get too many false positives.
Here's an implementation directly in MySql:
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
Upvotes: 2