How can I find similar records in the same table?

Question

I have a scheduling database, and I would like to find a way to select all future appointments that are similar because we have a lot of clients who double book themselves. I have been looking but I cant find a way to do quite what I want.

For example if I have the following rows I would like to be able to select those with the same language, a time within 15 minutes of each other, and with names that share > 70% of the same characters.

   |Rec_id|Date_time       |Language|App_name |
   |1     |2014-03-15 12:40|Spanish |Ricardo  |
   |2     |2014-03-15 12:45|Spanish |Ricerdu  |
   |3     |2014-03-16 12:45|Tongan  |Tuaffu   |
   |4     |2014-03-17 12:45|Korean  |Kim      |
   |5     |2014-03-18 12:45|German  |Biternof |
   |6     |2014-03-18 12:32|German  |Biterknof|

From the above data the records I would need are 1,2,5, and 6.

Niko Hujanen · Accepted Answer

First thing I thought of was Levenshtein but since MySQL has no native support for it - things get little more complex.

This solution isn't optimized or the best solution at all, but I should do the work.

I would create a new column, let's call it "Duplicate_for", DEFAULT NULL.
I would create a trigger: TRIGGER BEFORE INSERT for this table.
I would create a function to calculate the Levenshtein distance for two strings.
I would combine TRIGGER and Levenshtein with following query.

Trigger + query itself:

DELIMITER //
CREATE TRIGGER `booking_before_insert` BEFORE INSERT ON `booking` FOR EACH ROW BEGIN
    DECLARE existingId INT(10) DEFAULT NULL;

    SELECT 
        MAX(id) 
    INTO 
        existingId
    FROM 
        booking 
    WHERE   
        booking.dirty_id IS NULL AND
        booking.lang = NEW.lang AND
        booking.created >= DATE_SUB(NOW(), INTERVAL 15 MINUTE) AND 
        (LEVENSHTEIN(booking.name, NEW.name) / LENGTH(booking.name)) < 0.3;

    SET NEW.dirty_id = existingId;
END//
DELIMITER ;

You can read more about Levenshtein from:

Now you can detect the duplicates with Duplicate_for.

How can I find similar records in the same table?

Answers (2)

Related Questions