June Lewis
June Lewis

Reputation: 355

How can I find similar records in the same table?

I have a scheduling database, and I would like to find a way to select all future appointments that are similar because we have a lot of clients who double book themselves. I have been looking but I cant find a way to do quite what I want.

For example if I have the following rows I would like to be able to select those with the same language, a time within 15 minutes of each other, and with names that share > 70% of the same characters.

   |Rec_id|Date_time       |Language|App_name |
   |1     |2014-03-15 12:40|Spanish |Ricardo  |
   |2     |2014-03-15 12:45|Spanish |Ricerdu  |
   |3     |2014-03-16 12:45|Tongan  |Tuaffu   |
   |4     |2014-03-17 12:45|Korean  |Kim      |
   |5     |2014-03-18 12:45|German  |Biternof |
   |6     |2014-03-18 12:32|German  |Biterknof|

From the above data the records I would need are 1,2,5, and 6.

Upvotes: 1

Views: 300

Answers (2)

Manu
Manu

Reputation: 901

That's a nice question.. Kept me awake for a long time at Night.. :D :D First of all, you don't need any triggers and/or extra rows for solving this.

This is what I have attempted so far.

SELECT
t3.*
FROM
Table1 t1
INNER JOIN
Table1 t2
ON
(
  t2.Rec_id > t1.Rec_id
  AND t2.Language = t1.Language
  AND ABS( TIMESTAMPDIFF(MINUTE, t1.Date_time, t2.Date_time)) <= 15
  #AND (
  #  SOUNDEX( t1.App_name ) LIKE CONCAT(TRIM(TRAILING '0' FROM SOUNDEX( t2.App_name )), '%') 
  #  OR
  #  SOUNDEX( t2.App_name ) LIKE CONCAT(TRIM(TRAILING '0' FROM SOUNDEX( t1.App_name )), '%')
  #)
)
INNER JOIN
Table1 t3
ON( t1.Rec_id = t3.Rec_id OR t2.Rec_id = t3.Rec_id )
GROUP BY t3.Rec_id

I have commented out the SOUNDEX part.. 70% match is something fuzzy, isn't it.. Try uncommenting the SOUNDEX part of the query to check if it solves the larger problem.

Upvotes: 0

Niko Hujanen
Niko Hujanen

Reputation: 733

First thing I thought of was Levenshtein but since MySQL has no native support for it - things get little more complex.

This solution isn't optimized or the best solution at all, but I should do the work.

  • I would create a new column, let's call it "Duplicate_for", DEFAULT NULL.
  • I would create a trigger: TRIGGER BEFORE INSERT for this table.
  • I would create a function to calculate the Levenshtein distance for two strings.
  • I would combine TRIGGER and Levenshtein with following query.

Trigger + query itself:

DELIMITER //
CREATE TRIGGER `booking_before_insert` BEFORE INSERT ON `booking` FOR EACH ROW BEGIN
    DECLARE existingId INT(10) DEFAULT NULL;

    SELECT 
        MAX(id) 
    INTO 
        existingId
    FROM 
        booking 
    WHERE   
        booking.dirty_id IS NULL AND
        booking.lang = NEW.lang AND
        booking.created >= DATE_SUB(NOW(), INTERVAL 15 MINUTE) AND 
        (LEVENSHTEIN(booking.name, NEW.name) / LENGTH(booking.name)) < 0.3;

    SET NEW.dirty_id = existingId;
END//
DELIMITER ;

You can read more about Levenshtein from:

Now you can detect the duplicates with Duplicate_for.

Upvotes: 1

Related Questions