Reputation: 355
I have a scheduling database, and I would like to find a way to select all future appointments that are similar because we have a lot of clients who double book themselves. I have been looking but I cant find a way to do quite what I want.
For example if I have the following rows I would like to be able to select those with the same language, a time within 15 minutes of each other, and with names that share > 70% of the same characters.
|Rec_id|Date_time |Language|App_name |
|1 |2014-03-15 12:40|Spanish |Ricardo |
|2 |2014-03-15 12:45|Spanish |Ricerdu |
|3 |2014-03-16 12:45|Tongan |Tuaffu |
|4 |2014-03-17 12:45|Korean |Kim |
|5 |2014-03-18 12:45|German |Biternof |
|6 |2014-03-18 12:32|German |Biterknof|
From the above data the records I would need are 1,2,5, and 6.
Upvotes: 1
Views: 300
Reputation: 901
That's a nice question.. Kept me awake for a long time at Night.. :D :D First of all, you don't need any triggers and/or extra rows for solving this.
This is what I have attempted so far.
SELECT
t3.*
FROM
Table1 t1
INNER JOIN
Table1 t2
ON
(
t2.Rec_id > t1.Rec_id
AND t2.Language = t1.Language
AND ABS( TIMESTAMPDIFF(MINUTE, t1.Date_time, t2.Date_time)) <= 15
#AND (
# SOUNDEX( t1.App_name ) LIKE CONCAT(TRIM(TRAILING '0' FROM SOUNDEX( t2.App_name )), '%')
# OR
# SOUNDEX( t2.App_name ) LIKE CONCAT(TRIM(TRAILING '0' FROM SOUNDEX( t1.App_name )), '%')
#)
)
INNER JOIN
Table1 t3
ON( t1.Rec_id = t3.Rec_id OR t2.Rec_id = t3.Rec_id )
GROUP BY t3.Rec_id
I have commented out the SOUNDEX
part.. 70% match is something fuzzy, isn't it..
Try uncommenting the SOUNDEX
part of the query to check if it solves the larger problem.
Upvotes: 0
Reputation: 733
First thing I thought of was Levenshtein but since MySQL has no native support for it - things get little more complex.
This solution isn't optimized or the best solution at all, but I should do the work.
Trigger + query itself:
DELIMITER //
CREATE TRIGGER `booking_before_insert` BEFORE INSERT ON `booking` FOR EACH ROW BEGIN
DECLARE existingId INT(10) DEFAULT NULL;
SELECT
MAX(id)
INTO
existingId
FROM
booking
WHERE
booking.dirty_id IS NULL AND
booking.lang = NEW.lang AND
booking.created >= DATE_SUB(NOW(), INTERVAL 15 MINUTE) AND
(LEVENSHTEIN(booking.name, NEW.name) / LENGTH(booking.name)) < 0.3;
SET NEW.dirty_id = existingId;
END//
DELIMITER ;
You can read more about Levenshtein from:
Now you can detect the duplicates with Duplicate_for.
Upvotes: 1