Reputation: 9
I have a scrpit that compares two texts and makes highlight in the different words, but it does not work well at all. Many words mark them as different when they are not, for example the words "that" "the" etc ... does not take them into account and if they are between two words that if they have also changed is marked as changed. I attach an image.
<?php
$old = 'The one-page order, which Mr. Trump signed in a hastily arranged Oval Office ceremony shortly before departing for the inaugural balls, gave no specifics about which aspects of the law it was targeting. But its broad language gave federal agencies wide latitude to change, delay or waive provisions of the law that they deemed overly costly for insurers, drug makers, doctors, patients or states, suggesting that it could have wide-ranging impact, and essentially allowing the dismantling of the law to begin even before Congress moves to repeal it.';
$new = 'The one-page order, which Mr. Trump signed in a unexpectedly organized Oval workplace rite quickly before departing for the inaugural balls, gave no specifics approximately which components of the law it became targeting. But its large language gave federal organizations huge range to exchange, put off or waive provisions of the law that they deemed overly luxurious for insurers, drug makers, docs, sufferers or states, suggesting that it could have wide-ranging effect, and basically permitting the dismantling of the regulation to start even before Congress moves to repeal it.';
$oldArr = preg_split('/\s+/', $old);// old (initial) text splitted into words
$newArr = preg_split('/\s+/', $new);// new text splitted into words
$resArr = array();
$oldCount = count($oldArr)-1;
$newCount = count($newArr)-1;
$tmpOld = 0;// marker position for old (initial) string
$tmpNew = 0;// marker position for new (modified) string
$end = 0;// do while variable
// endless do while loop untill specified otherwise
while($end == 0){
// if marker position is less or equal than max count for initial text
// to make sure we don't overshoot the max lenght
if($tmpOld <= $oldCount){
// we check if current words from both string match, at the current marker positions
if($oldArr[$tmpOld] === $newArr[$tmpNew]){
// if they match, nothing has been modified, we push the word into results and increment both markers
array_push($resArr,$oldArr[$tmpOld]);
$tmpOld++;
$tmpNew++;
}else{
// fi the words don't match, we need to check for recurrence of the searched word in the entire new string
$foundKey = array_search($oldArr[$tmpOld],$newArr,TRUE);
// if we find it
if($foundKey != '' && $foundKey > $tmpNew){
// we get all the words from the new string between the current marker and the foundKey exclusive
// and we place them into results, marking them as new words
for($p=$tmpNew;$p<$foundKey;$p++){
array_push($resArr,'<span class="new-word">'.$newArr[$p].'</span>');
}
// after that, we insert the found word as unmodified
array_push($resArr,$oldArr[$tmpOld]);
// and we increment old marker position by 1
$tmpOld++;
// and set the new marker position at the found key position, plus one
$tmpNew = $foundKey+1;
}else{
// if the word wasn't found it means it has been deleted
// and we need to add ti to results, marked as deleted
array_push($resArr,'<span class="old-word">'.$oldArr[$tmpOld].'</span>');
// and increment the old marker by one
$tmpOld++;
}
}
}else{
$end = 1;
}
}
$textFinal = '';
foreach($resArr as $val){
$textFinal .= $val.' ';
}
echo "<p>".$textFinal."</p>";
?>
<style>
body {
background-color: #2A2A2A;
}
@font-face {
font-family: 'Eras Light ITC';
font-style: normal;
font-weight: normal;
src: local('Eras Light ITC'), url('ERASLGHT.woff') format('woff');
}
p {
font-family: 'Eras Light ITC', Arial;
color:white;
}
.new-word{background:rgba(1, 255, 133, 0.9);color:black;font-weight: bold;}
.new-word:after{background:rgba(1, 255, 133, 0.9)}
.old-word{text-decoration:none; position:relative;background:rgba(215, 40, 40, 0.9);}
.old-word:after{
}
</style>
Example:
Why do you mark those different words if they have not changed? Regards!
Upvotes: 0
Views: 164
Reputation: 1493
I inspected your code, tried different cases and I think your algorithm is wrong.
For example if you type "one-page" instead of "for" or "the", you will see that it seems like "unmatch". The reason behind this, when there is mismatch, you are searching mismatched word in all array. Then if given word is already skipped (exist with less index number), your algorithm fails.
To see that, you can use following variables.
$old = 'for costly for insurers.';
$new = 'for luxurious for insurers.';
For this setup, when costly-luxurious mismatch found, your code tries to match following "for" words. But the array_search call which you are using returns the position of "for" at the beginning of your string.
$foundKey = array_search($oldArr[$tmpOld],$newArr,TRUE);
So you should try to revise this section to search in different way. You may code your array_search which have "starting_indices" functionality. (Or, maybe you can unset the successfully matched elements from array.)
Upvotes: 0