Reputation: 24122
I have an interface where a user will enter the name of a company. It then compares what they typed to current entries in the database, and if something similar is found it presents them with options (in case they misspelled) or they can click a button which confirms what they typed is definitely new and unique.
The problem I am having is that it is not very accurate and often brings up dozens of "similar" matches that aren't that similar at all!
Here is what I have now, the first large function I didn't make and I am not clear on what exactly it does. Is there are much simpler way to acheive what I want?
// Compares strings and determines how similar they are based on a nth letter split comparison.
function cmp_by_optionNumber($b, $a) {
if ($a["score"] == $b["score"]) return 0;
if ($a["score"] > $b["score"]) return 1;
return -1;
}
function string_compare($str_a, $str_b)
{
$length = strlen($str_a);
$length_b = strlen($str_b);
$i = 0;
$segmentcount = 0;
$segmentsinfo = array();
$segment = '';
while ($i < $length)
{
$char = substr($str_a, $i, 1);
if (strpos($str_b, $char) !== FALSE)
{
$segment = $segment.$char;
if (strpos($str_b, $segment) !== FALSE)
{
$segmentpos_a = $i - strlen($segment) + 1;
$segmentpos_b = strpos($str_b, $segment);
$positiondiff = abs($segmentpos_a - $segmentpos_b);
$posfactor = ($length - $positiondiff) / $length_b; // <-- ?
$lengthfactor = strlen($segment)/$length;
$segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
}
else
{
$segment = '';
$i--;
$segmentcount++;
}
}
else
{
$segment = '';
$segmentcount++;
}
$i++;
}
// PHP 5.3 lambda in array_map
$totalscore = array_sum(array_map(function($v) { return $v['score']; }, $segmentsinfo));
return $totalscore;
}
$q = $_POST['stringA'] ;
$qLengthMin = strlen($q) - 5 ; // Part of search calibration. Smaller number = stricter.
$qLengthMax = strlen($q) + 2 ; // not in use.
$main = array() ;
include("pdoconnect.php") ;
$result = $dbh->query("SELECT id, name FROM entity_details WHERE
name LIKE '{$q[0]}%'
AND CHAR_LENGTH(name) >= '$qLengthMin'
#LIMIT 50") ; // The first letter MUST be correct. This assumption makes checker faster and reduces irrelivant results.
$x = 0 ;
while($row = $result->fetch(PDO::FETCH_ASSOC)) {
$percent = string_compare(strtolower($q), strtolower(rawurldecode($row['name']))) ;
if($percent == 1) {
//echo 1 ;// 1 signifies an exact match on a company already in our DB.
echo $row['id'] ;
exit() ;
}
elseif($percent >= 0.6) { // Part of search calibration. Higher deci number = stricter.
$x++ ;
$main[$x]['name'] = rawurldecode($row['name']) ;
$main[$x]['score'] = round($percent, 2) * 100;
//array_push($overs, urldecode($row['name']) . " ($percent)<br />") ;
}
}
usort($main, "cmp_by_optionNumber") ;
$z = 0 ;
echo '<div style="overflow-y:scroll;height:175px;width:460px;">' ;
foreach($main as $c) {
if($c['score'] > 100) $c['score'] = 100 ;
if(count($main) > 1) {
echo '<div id="anysuggested' . $z . '" class="hoverdiv" onclick="selectAuto(' . "'score$z'" . ');">' ;
}
else echo '<div id="anysuggested' . $z . '" class="hoverdiv" style="color:#009444;" onclick="selectAuto(' . "'score$z'" . ');">' ;
echo '<span id="autoscore' . $z . '">' . $c['name'] . '</span></div>' ;
$z++ ;
}
echo '</div>' ;
Upvotes: 0
Views: 1284
Reputation: 3288
You need aproximate/fuzzy string matching.
Read more about http://php.net/manual/en/function.levenshtein.php, http://www.slideshare.net/kyleburton/fuzzy-string-matching
The best way would be to use some index based search engine like SOLR http://lucene.apache.org/solr/.
Upvotes: 0
Reputation: 6106
Comparing strings is a huge topic and there are many ways to do it. One very common algorithm is called the Levenshtein difference. This is a native implementation in PHP but none in MySQL. There is however an implementation here that you could use.
Upvotes: 1