user3382203
user3382203

Reputation: 179

MYSQL PHP: Find duplicates based on Address Column

I have an addresses table in my MYSQL database with the following structure:

enter image description here

I wish to display all the records and highlight the possible duplicates, based on the address/address slug.

In this case, the duplicates are as follows:

Is there a way to partially match a string in MYSQL or PHP, to achieve the above results?

FYI: I have gone through SPHINX PHP, SQL FULLTEXT SEARCHES etc.

I have been struggling over 2 weeks, but couldn't find any optimal solution.

Any ideas, suggestions, solutions are welcome.

Upvotes: 2

Views: 415

Answers (2)

FanoFN
FanoFN

Reputation: 7114

  1. Make an empty duplicate of the table - e.g. mytable_to_update.
  2. Run a few queries to find out duplicates.
  • Start with populating the newly created table with non-duplicates. Initial query:
SELECT SUBSTRING_INDEX(Name,' ',1),COUNT(*) 
FROM mytable_to_update 
GROUP BY SUBSTRING_INDEX(Name,' ',1) HAVING COUNT(*) = 1;

The SUBSTRING_INDEX will capture the first string before space (' '). In the example, Sam Mcarthy will become Sam only. Then using that to group and count how many name occurrences it has. HAVING COUNT(*) = 1 will only show any name occurring once. But that might as well return nothing if there's a name like Joe and Joe John but the two are actually a different person with different addresses (since the first query only group by the first name occurring). Therefore, we need to add address comparison in the mix.

  • Add the same function to the Address column like this:
SELECT SUBSTRING_INDEX(Name,' ',1), 
       SUBSTRING_INDEX(Address,' ',1), /*we take the first string in the address*/
       COUNT(*)
FROM mytable_to_update 
GROUP BY SUBSTRING_INDEX(Name,' ',1), 
      SUBSTRING_INDEX(Address,' ',1) /*then add group by for the address*/
HAVING COUNT(*) = 1;

Similarly, we take only the first string occurrence from the address. So let's say for example there are two data that looks like this, Joe, 12 Street.. and Joe John, 12 St. .., what will happen is the query above will (given the SUBSTRING_INDEX function) take only the first string occurrence; Joe, 12 , which will return the count value as 2. That means both data (Joe, 12 Street.. and Joe John, 12 St. ..) are considered as duplicates and will not show in the query results.

  • Change the query to list out all non-duplicates ID to be inserted into mytable_to_update table:
INSERT INTO mytable_to_update 
SELECT * FROM mytable WHERE ID IN
(SELECT GROUP_CONCAT(ID) /*replace everything else in the select with just `ID`*/
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1), 
      SUBSTRING_INDEX(Address,' ',1)
HAVING COUNT(*) = 1) ;

Note: I'm using GROUP_CONCAT(ID) because of incompatibility of sql_mode=only_full_group_by - if it's being set. Of course the result could be different (like '1,2' or '1,,,,,') but since we're only looking at any count=1, it shouldn't have a problem as it will only return 1 value. I've tested with ANY_VALUE it also return similar results.

Now you have all the non-duplicates inside the mytable_to_update table. the next step is to search for duplicates and insert the ones that you only want. This is merely a suggestion/assumption of what you might want and it's not 100% accurate due to the nature of the data value that we're comparing.

  • The query is similarly structured and changed only in a few places, for example:
SELECT GROUP_CONCAT(ID), /*add GROUP_CONCAT to list all the duplicates group by the first name & address string.*/
       Name, 
       Address, 
       COUNT(*) 
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1), 
         SUBSTRING_INDEX(Address,' ',1) 
HAVING COUNT(*) > 1; /*Change '= 1' to '> 1' to get any records with more than 1 count.*/

Using GROUP_CONCAT to generate a comma separated list of ID that has possible duplicates.

  • Then add GROUP_CONCAT over all the columns listed with identical ORDER BY so every columns will be ordering by the same thing.
SELECT GROUP_CONCAT(ID ORDER BY ID), /*add ORDER BY*/
       GROUP_CONCAT(Name ORDER BY ID), 
       GROUP_CONCAT(Address ORDER BY ID), 
       COUNT(*) 
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1), 
         SUBSTRING_INDEX(Address,' ',1) 
HAVING COUNT(*) > 1;

With this you go over the values it returned for any of the duplicates and compare it side by side. That way you can decide to omit any ID that you don't want to appear in the list by adding WHERE ID NOT IN(1,3 ...) etc.

  • Once you've finalized which ID you want to keep, you can do something like this:
INSERT INTO mytable_to_update 
SELECT * FROM mytable WHERE ID IN
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(ID ORDER BY ID),',',1) 
     /*assuming that you only want the first ID in the set, do SUBSTRING_INDEX to separate the first ID*/
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1), 
         SUBSTRING_INDEX(Address,' ',1) 
HAVING COUNT(*) > 1);

Now you'll have a table (mytable_to_update) that might probably have all non-duplicates. In case some of the data in the mytable_to_update are not what you want, you can just remove it or in case there are some data you think is not a duplicate, you can insert it. It's pretty much a manual process afterwards; well, even with the queries, only yourself can determine whether the processes/data are correct.

Here's a fiddle: https://www.db-fiddle.com/f/6Dfrn78mqZbGTwZs3U9Vhi/0

Upvotes: 1

Digvijay
Digvijay

Reputation: 8957

Since laravel was tagged initially, later removed, I thought the strategy can still help.

This is the given list:

$lists = [
    [
        'id' => 1,
        'text' => '2693 Edgewood Road Exit',
    ],
    [
        'id' => 2,
        'text' => '4408 Cost 4657 Avenue',
    ],
    [
        'id' => 3,
        'text' => '2693 Mapleview Road',
    ],
    [
        'id' => 4,
        'text' => '4657 Cost Edgewood Avenue',
    ],
    [
        'id' => 5,
        'text' => '4408 Mapleview Drive Road',
    ]
];

Goal is to find repetitive/duplicate texts from each.


Since finding duplication of ONE word is not a real scenario, I thought of finding the duplication with TWO words with all the combinations possible.

    $combinations = [];
    foreach ($lists as $list) {

        $insideCombo = [];
        $insideText = explode(' ', $list['text']);
        $length = count($insideText);

        for ($i = 0; $i < $length; $i++) {
            for ($j = $i + 1; $j < $length; $j++) {
                if (isset($insideText[$j])) {
                    $insideCombo[] = $insideText[$i] . ' ' . $insideText[$j];
                }
            }
        }

        $combinations[$list['id']] = $insideCombo;
    }

This is gonna return

// for '2693 Edgewood Road Exit'
1 => array:6 [
    0 => "2693 Edgewood"
    1 => "2693 Road"
    2 => "2693 Exit"
    3 => "Edgewood Road"
    4 => "Edgewood Exit"
    5 => "Road Exit"
]

Now, we loop again to compare the possible repetition. Here, we leverage Laravel's Str::containsAll()

$copyCat = [];
foreach ($lists as $list) {
    foreach ($combinations as $comboKey => $combination) {
        /* no need to compare the text with itself && 
        *  to avoid duplication of '4 to 2' if '2 to 4' is already mentioned
        */
        if ($list['id'] != $comboKey && $list['id'] < $comboKey) {
            foreach ($combination as $row) {
                if (Str::containsAll($list['text'], explode(' ', $row))) {
                    $copyCat[] = $list['id'] . ' matches with ' . $comboKey . ' with "' . $row . '"';
                }
            }
        }
    }
}

Final Response of $copyCat

array:5 [
  0 => "1 matches with 3 with [2693 Road]"
  1 => "2 matches with 4 with [4657 Cost]"
  2 => "2 matches with 4 with [4657 Avenue]"
  3 => "2 matches with 4 with [Cost Avenue]"
  4 => "3 matches with 5 with [Mapleview Road]"
]

Keep me posted in the comments below. Cheers!

Upvotes: 1

Related Questions