Reputation: 13
What's an elegant code for finding common alphabets among two strings without including the space in PHP?
Also return a similarity index i.e. count the number of common characters and return that as a percentage over the total number of characters.
Suppose i have one string "LEGENDARY ", whereas other as "BARNEY STINSON" so i need to find common letters b/w both without including space.
Also my code should return a similarity index i.e. count the number of common characters and return that as a percentage over the total number of characters.
For these two strings, the common characters are "ARNEY" so the score is 5/22 ~= 22%. Any help on this?
Upvotes: 1
Views: 3172
Reputation: 5617
I've found that to calculate a similarity percentage between strings, the Levenshtein and Jaro Winkler algorithms work well for spelling mistakes and small changes between strings, while the Smith Waterman Gotoh algorithm works well for strings where significant chunks of the text would be the same but with "noise" around the edges. This answer to a similar question shows more detail on this.
Here are php examples of using each of these three examples to return a similarity percentage between two strings:
echo levenshtein("LEGENDARY","BARNEY STINSON");
class StringCompareJaroWinkler
{
public function compare($str1, $str2)
{
return $this->JaroWinkler($str1, $str2, $PREFIXSCALE = 0.1 );
}
private function getCommonCharacters( $string1, $string2, $allowedDistance ){
$str1_len = mb_strlen($string1);
$str2_len = mb_strlen($string2);
$temp_string2 = $string2;
$commonCharacters='';
for( $i=0; $i < $str1_len; $i++){
$noMatch = True;
// compare if char does match inside given allowedDistance
// and if it does add it to commonCharacters
for( $j= max( 0, $i-$allowedDistance ); $noMatch && $j < min( $i + $allowedDistance + 1, $str2_len ); $j++){
if( $temp_string2[$j] == $string1[$i] ){
$noMatch = False;
$commonCharacters .= $string1[$i];
$temp_string2[$j] = '';
}
}
}
return $commonCharacters;
}
private function Jaro( $string1, $string2 ){
$str1_len = mb_strlen( $string1 );
$str2_len = mb_strlen( $string2 );
// theoretical distance
$distance = (int) floor(min( $str1_len, $str2_len ) / 2.0);
// get common characters
$commons1 = $this->getCommonCharacters( $string1, $string2, $distance );
$commons2 = $this->getCommonCharacters( $string2, $string1, $distance );
if( ($commons1_len = mb_strlen( $commons1 )) == 0) return 0;
if( ($commons2_len = mb_strlen( $commons2 )) == 0) return 0;
// calculate transpositions
$transpositions = 0;
$upperBound = min( $commons1_len, $commons2_len );
for( $i = 0; $i < $upperBound; $i++){
if( $commons1[$i] != $commons2[$i] ) $transpositions++;
}
$transpositions /= 2.0;
// return the Jaro distance
return ($commons1_len/($str1_len) + $commons2_len/($str2_len) + ($commons1_len - $transpositions)/($commons1_len)) / 3.0;
}
private function getPrefixLength( $string1, $string2, $MINPREFIXLENGTH = 4 ){
$n = min( array( $MINPREFIXLENGTH, mb_strlen($string1), mb_strlen($string2) ) );
for($i = 0; $i < $n; $i++){
if( $string1[$i] != $string2[$i] ){
// return index of first occurrence of different characters
return $i;
}
}
// first n characters are the same
return $n;
}
private function JaroWinkler($string1, $string2, $PREFIXSCALE = 0.1 ){
$JaroDistance = $this->Jaro( $string1, $string2 );
$prefixLength = $this->getPrefixLength( $string1, $string2 );
return $JaroDistance + $prefixLength * $PREFIXSCALE * (1.0 - $JaroDistance);
}
}
$jw = new StringCompareJaroWinkler();
echo $jw->compare("LEGENDARY","BARNEY STINSON");
class SmithWatermanGotoh
{
private $gapValue;
private $substitution;
/**
* Constructs a new Smith Waterman metric.
*
* @param gapValue
* a non-positive gap penalty
* @param substitution
* a substitution function
*/
public function __construct($gapValue=-0.5,
$substitution=null)
{
if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
//if(empty($substitution)) throw new Exception("substitution is required");
if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
else $this->substitution = $substitution;
$this->gapValue = $gapValue;
}
public function compare($a, $b)
{
if (empty($a) && empty($b)) {
return 1.0;
}
if (empty($a) || empty($b)) {
return 0.0;
}
$maxDistance = min(mb_strlen($a), mb_strlen($b))
* max($this->substitution->max(), $this->gapValue);
return $this->smithWatermanGotoh($a, $b) / $maxDistance;
}
private function smithWatermanGotoh($s, $t)
{
$v0 = [];
$v1 = [];
$t_len = mb_strlen($t);
$max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));
for ($j = 1; $j < $t_len; $j++) {
$v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
$this->substitution->compare($s, 0, $t, $j));
$max = max($max, $v0[$j]);
}
// Find max
for ($i = 1; $i < mb_strlen($s); $i++) {
$v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));
$max = max($max, $v1[0]);
for ($j = 1; $j < $t_len; $j++) {
$v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
$v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));
$max = max($max, $v1[$j]);
}
for ($j = 0; $j < $t_len; $j++) {
$v0[$j] = $v1[$j];
}
}
return $max;
}
}
class SmithWatermanMatchMismatch
{
private $matchValue;
private $mismatchValue;
/**
* Constructs a new match-mismatch substitution function. When two
* characters are equal a score of <code>matchValue</code> is assigned. In
* case of a mismatch a score of <code>mismatchValue</code>. The
* <code>matchValue</code> must be strictly greater then
* <code>mismatchValue</code>
*
* @param matchValue
* value when characters are equal
* @param mismatchValue
* value when characters are not equal
*/
public function __construct($matchValue, $mismatchValue) {
if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");
$this->matchValue = $matchValue;
$this->mismatchValue = $mismatchValue;
}
public function compare($a, $aIndex, $b, $bIndex) {
return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
: $this->mismatchValue);
}
public function max() {
return $this->matchValue;
}
public function min() {
return $this->mismatchValue;
}
}
$o = new SmithWatermanGotoh();
echo $o->compare("LEGENDARY","BARNEY STINSON");
Upvotes: 10
Reputation: 10537
see similar_text()
. And if you want to exclude spaces simple str_replace(' ', '', $string);
prior.
echo similar_text ( 'LEGENDARY' , 'BARNEYSTINSON', $percent); // outputs 3
echo $percent; // outputs 27.272727272727
Here's another way using only unique characters
<?php
function unique_chars($string) {
return count_chars(strtolower(str_replace(' ', '', $string)), 3);
}
function compare_strings($a, $b) {
$index = similar_text(unique_chars($a), unique_chars($b), $percent);
return array('index' => $index, 'percent' => $percent);
}
print_r( compare_strings('LEGENDARY', 'BARNEY STINSON') );
// outputs:
?>
Array
(
[index] => 5
[percent] => 55.555555555556
)
Upvotes: 1