Reputation: 179
I have an addresses table in my MYSQL database with the following structure:
I wish to display all the records and highlight the possible duplicates, based on the address/address slug.
In this case, the duplicates are as follows:
Is there a way to partially match a string in MYSQL or PHP, to achieve the above results?
FYI: I have gone through SPHINX PHP, SQL FULLTEXT SEARCHES etc.
I have been struggling over 2 weeks, but couldn't find any optimal solution.
Any ideas, suggestions, solutions are welcome.
Upvotes: 2
Views: 415
Reputation: 7114
mytable_to_update
.SELECT SUBSTRING_INDEX(Name,' ',1),COUNT(*)
FROM mytable_to_update
GROUP BY SUBSTRING_INDEX(Name,' ',1) HAVING COUNT(*) = 1;
The SUBSTRING_INDEX will capture the first string before space (' '). In the example, Sam Mcarthy
will become Sam
only. Then using that to group and count how many name occurrences it has. HAVING COUNT(*) = 1
will only show any name occurring once. But that might as well return nothing if there's a name like Joe
and Joe John
but the two are actually a different person with different addresses (since the first query only group by the first name occurring). Therefore, we need to add address
comparison in the mix.
Address
column like this:SELECT SUBSTRING_INDEX(Name,' ',1),
SUBSTRING_INDEX(Address,' ',1), /*we take the first string in the address*/
COUNT(*)
FROM mytable_to_update
GROUP BY SUBSTRING_INDEX(Name,' ',1),
SUBSTRING_INDEX(Address,' ',1) /*then add group by for the address*/
HAVING COUNT(*) = 1;
Similarly, we take only the first string occurrence from the address. So let's say for example there are two data that looks like this, Joe, 12 Street..
and Joe John, 12 St. ..
, what will happen is the query above will (given the SUBSTRING_INDEX
function) take only the first string occurrence; Joe, 12
, which will return the count value as 2. That means both data (Joe, 12 Street..
and Joe John, 12 St. ..
) are considered as duplicates and will not show in the query results.
ID
to be inserted into mytable_to_update
table:INSERT INTO mytable_to_update
SELECT * FROM mytable WHERE ID IN
(SELECT GROUP_CONCAT(ID) /*replace everything else in the select with just `ID`*/
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1),
SUBSTRING_INDEX(Address,' ',1)
HAVING COUNT(*) = 1) ;
Note: I'm using GROUP_CONCAT(ID) because of incompatibility of sql_mode=only_full_group_by - if it's being set. Of course the result could be different (like '1,2' or '1,,,,,') but since we're only looking at any count=1, it shouldn't have a problem as it will only return 1 value. I've tested with ANY_VALUE it also return similar results.
Now you have all the non-duplicates inside the mytable_to_update
table. the next step is to search for duplicates and insert the ones that you only want. This is merely a suggestion/assumption of what you might want and it's not 100% accurate due to the nature of the data value that we're comparing.
SELECT GROUP_CONCAT(ID), /*add GROUP_CONCAT to list all the duplicates group by the first name & address string.*/
Name,
Address,
COUNT(*)
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1),
SUBSTRING_INDEX(Address,' ',1)
HAVING COUNT(*) > 1; /*Change '= 1' to '> 1' to get any records with more than 1 count.*/
Using GROUP_CONCAT to generate a comma separated list of ID
that has possible duplicates.
GROUP_CONCAT
over all the columns listed with identical ORDER BY
so every columns will be ordering by the same thing.SELECT GROUP_CONCAT(ID ORDER BY ID), /*add ORDER BY*/
GROUP_CONCAT(Name ORDER BY ID),
GROUP_CONCAT(Address ORDER BY ID),
COUNT(*)
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1),
SUBSTRING_INDEX(Address,' ',1)
HAVING COUNT(*) > 1;
With this you go over the values it returned for any of the duplicates and compare it side by side. That way you can decide to omit any ID that you don't want to appear in the list by adding WHERE ID NOT IN(1,3 ...)
etc.
ID
you want to keep, you can do something like this:INSERT INTO mytable_to_update
SELECT * FROM mytable WHERE ID IN
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(ID ORDER BY ID),',',1)
/*assuming that you only want the first ID in the set, do SUBSTRING_INDEX to separate the first ID*/
FROM mytable
GROUP BY SUBSTRING_INDEX(Name,' ',1),
SUBSTRING_INDEX(Address,' ',1)
HAVING COUNT(*) > 1);
Now you'll have a table (mytable_to_update
) that might probably have all non-duplicates. In case some of the data in the mytable_to_update
are not what you want, you can just remove it or in case there are some data you think is not a duplicate, you can insert it. It's pretty much a manual process afterwards; well, even with the queries, only yourself can determine whether the processes/data are correct.
Here's a fiddle: https://www.db-fiddle.com/f/6Dfrn78mqZbGTwZs3U9Vhi/0
Upvotes: 1
Reputation: 8957
Since laravel
was tagged initially, later removed, I thought the strategy can still help.
This is the given list:
$lists = [
[
'id' => 1,
'text' => '2693 Edgewood Road Exit',
],
[
'id' => 2,
'text' => '4408 Cost 4657 Avenue',
],
[
'id' => 3,
'text' => '2693 Mapleview Road',
],
[
'id' => 4,
'text' => '4657 Cost Edgewood Avenue',
],
[
'id' => 5,
'text' => '4408 Mapleview Drive Road',
]
];
Goal is to find repetitive/duplicate texts from each.
Since finding duplication of ONE word is not a real scenario, I thought of finding the duplication with TWO words with all the combinations possible.
$combinations = [];
foreach ($lists as $list) {
$insideCombo = [];
$insideText = explode(' ', $list['text']);
$length = count($insideText);
for ($i = 0; $i < $length; $i++) {
for ($j = $i + 1; $j < $length; $j++) {
if (isset($insideText[$j])) {
$insideCombo[] = $insideText[$i] . ' ' . $insideText[$j];
}
}
}
$combinations[$list['id']] = $insideCombo;
}
This is gonna return
// for '2693 Edgewood Road Exit'
1 => array:6 [
0 => "2693 Edgewood"
1 => "2693 Road"
2 => "2693 Exit"
3 => "Edgewood Road"
4 => "Edgewood Exit"
5 => "Road Exit"
]
Now, we loop again to compare the possible repetition. Here, we leverage Laravel's Str::containsAll()
$copyCat = [];
foreach ($lists as $list) {
foreach ($combinations as $comboKey => $combination) {
/* no need to compare the text with itself &&
* to avoid duplication of '4 to 2' if '2 to 4' is already mentioned
*/
if ($list['id'] != $comboKey && $list['id'] < $comboKey) {
foreach ($combination as $row) {
if (Str::containsAll($list['text'], explode(' ', $row))) {
$copyCat[] = $list['id'] . ' matches with ' . $comboKey . ' with "' . $row . '"';
}
}
}
}
}
Final Response of $copyCat
array:5 [
0 => "1 matches with 3 with [2693 Road]"
1 => "2 matches with 4 with [4657 Cost]"
2 => "2 matches with 4 with [4657 Avenue]"
3 => "2 matches with 4 with [Cost Avenue]"
4 => "3 matches with 5 with [Mapleview Road]"
]
Keep me posted in the comments below. Cheers!
Upvotes: 1