Reputation: 193
I have a mysql table with a column (code) that contains 6 character strings (varchar(6)). An example record for that column would be:
ID code
1 ADHNQS
2 BDHLQS
3 AEGMQS
What I need to do is compare 1 code record at a time to the rest and retrieve "related codes". A related code will qualify if it contains at 4 or more of the same characters. Example:
ADHNQS -> BDHLQS would be a match since D,H,Q,S are shared and are 4 or greater.
ADHNQS -> BCHLQR would NOT be a match since their shared characters are less than 4.
How can I retrieve a set of all records in SQL with their character match count WHERE they are 4 or greater?I have looked into many of the MYSQL string functions but have not found anything that is jumping out to me as a simple solution. Thank you in advance for your help!
Upvotes: 1
Views: 2238
Reputation: 7847
This looked interesting so I gave it a shot. First I built a numbers table that literally just has numbers 1-15 in it. Then using that table I split the string along with its index.
Then I join that to itself and look for matches of 4 or more. In theory this should work for any string length as long as your numbers table is big enough.
select a2.code
from
(
select *,
SUBSTRING(t.code, n.num, 1) as 'Character'
from numbers n
join test t
on length(t.code) >= n.num
) a1
join
(
select *,
SUBSTRING(t.code, n.num, 1) as 'Character'
from numbers n
join test t
on length(t.code) >= n.num
) a2 on a1.character = a2.character and a1.id <> a2.id
where a1.id = 1
group by a2.code having count(1) >= 4
Here's the SQL Fiddle Demo
Upvotes: 0
Reputation: 108410
In the example "match", the characters that match are in the same positions in both strings. It's not clear if this is the actual specification, or if that's just an anomaly in the example. Also, we note that in the example data, the list of characters is distinct, there aren't two of the same character in any string. Again, not sure if that's part of the specification, or an anomaly in the example.
Also, are the code values always six characters in length? Any special handling for shorter strings, or space characters? Etc.
In the simplest case, where we're comparing the strings position by position, and the only requirement is that a character be equal to another character (no special handling for spaces, or non-alphabetic, etc.) then something like this would return the specified result:
SELECT c.id
, c.code
, d.id
, d.code
FROM mytable c
JOIN mytable d
ON d.id <> c.id
AND ( IFNULL( NULLIF(SUBSTR(c.code,1,1),'') = NULLIF(SUBSTR(d.code,1,1),'') ,0)
+ IFNULL( NULLIF(SUBSTR(c.code,2,1),'') = NULLIF(SUBSTR(d.code,2,1),'') ,0)
+ IFNULL( NULLIF(SUBSTR(c.code,3,1),'') = NULLIF(SUBSTR(d.code,3,1),'') ,0)
+ IFNULL( NULLIF(SUBSTR(c.code,4,1),'') = NULLIF(SUBSTR(d.code,4,1),'') ,0)
+ IFNULL( NULLIF(SUBSTR(c.code,5,1),'') = NULLIF(SUBSTR(d.code,5,1),'') ,0)
+ IFNULL( NULLIF(SUBSTR(c.code,6,1),'') = NULLIF(SUBSTR(d.code,6,1),'') ,0)
) >= 4
WHERE c.id = 1
ORDER BY c.id, d.id
If we need to compare each character in code
to each of the characters in the other code
, we'd have something similar, we'd just need to perform a total of 36 comparisons. (compare pos 1 to pos
1,2,3,4,5,6, compare pos 2 to pos 1,2,3,4,5,6)
That could be done exactly the same as the query above, except that the AND ( ) >= 4
predicate would contain a total of 36 comparison tests, rather than just six.
That brings up (again) the issue of multiples of the same character in the string, and how those should be "counted" towards a match. For example, consider:
code1: QbQdef
code2: QxyQQz
The Q in position 1 of code1 matches three Q's in code2, and the Q in position 2 of code1 also matches three Q's in code2... for a total match count of 6. Do we want to consider these two codes as matching? If not, we could modify the test block in the query a bit, so that a match of the character in position 1 to any character in code2 would result in only 1 being added to the match count.
The specifications need to be fleshed out more, in order to make a determination of the actual SQL statement that achieves the desired result.
Upvotes: 1
Reputation: 2698
If you just have the id of your reference row (here: 1):
SELECT b.ID, b.code,
(if(substring(b.code,1,1)=substr(a.code,1,1),1,0) + if(substring(b.code,2,1)=substr(a.code,2,1),1,0) + if(substring(b.code,3,1)=substr(a.code,3,1),1,0) + if(substring(b.code,4,1)=substr(a.code,4,1),1,0) + if(substring(b.code,5,1)=substr(a.code,5,1),1,0) + if(substring(b.code,6,1)=substr(a.code,6,1),1,0)) as matchcount
FROM yourtablename as a, yourtablename as b
WHERE a.ID=1
AND b.ID<>a.ID
GROUP BY 1
HAVING matchcount>=4
ORDER BY matchcount desc
Returns:
ID code matchcount
2 BDHLQS 4
If you just have the code (here: ADHNQS), then you can build your query manually like this (will return your exact code, too, if it exists):
SELECT ID, code,
(if(substring(code,1,1)="A",1,0) + if(substring(code,2,1)="D",1,0) + if(substring(code,3,1)="H",1,0) + if(substring(code,4,1)="N",1,0) + if(substring(code,5,1)="Q",1,0) + if(substring(code,6,1)="S",1,0)) as matchcount
FROM yourtablename
GROUP BY 1
HAVING matchcount>=4
ORDER BY matchcount desc
Returns:
ID code matchcount
1 ADHNQS 6
2 BDHLQS 4
Upvotes: 2
Reputation: 2686
@spencer7593 has a really long nice sql statement that works when the varchar positions are the same.
but if you want to solve this problem with php and the char positions are not always the same:
$string1 = 'SOMESTRING';
$stringAsArray = str_split($string1);
sort($stringAsArray);
$string2 = "ASDFOMKHRG";
$string2AsArray = str_split($string2);
$count = 0;
foreach($stringAsArray as $value){
foreach($string2AsArray as $value2){
if($value == $value2) count+=1;
}
}
if(count >= 4) return string2;
NOTE: One problem with this though is if the strings have duplicate chars. There would have to be some extra logic. Not shown by the question so I didn't add it.
Upvotes: 1