user188995
user188995

Reputation: 557

Insert HTML formatted string into another string

I have two strings. One of them contains <em> tag, is completely lowercase and doesn't contain delimiters or common words like 'the', 'in', etc. while the other isn't. An example:

$str1 = 'world <em>round</em>';
$str2 = 'World - is Round';

I want to make the $str2 as 'World - is <em>Round</em>', by comparing which lowercase word in the $str1 contains the <em> tag. So far, I've done the following, but is fails if number of words aren't equal in both strings.

public static function applyHighlighingOnDisplayName($str1, $str2) {
    $str1_w = explode(' ', $str1);
    $str2_w = explode(' ', $str2);
    for ($i=0; $i<count($str1_w); $i++) {
       if (strpos($str1_w[$i], '<em>') !== false) {
            $str2_w[$i] = '<em>' . $str2_w[$i] . '</em>';
       }
    }
    return implode(' ', $str2_w);
}

$str1 = '<em>cup</em> <em>cakes</em>' & $str2 = 'Cup Cakes':

applyHighlighingOnDisplayName($str1, $str2) : '<em>Cup</em> <em>Cakes</em>': Correct

$str1 = 'cup <em>cakes</em>' & $str2 = 'The Cup Cakes':

applyHighlighingOnDisplayName($str1, $str2) : 'The <em>Cup</em> Cakes: Incorrect

How should I change my approach?

Upvotes: 0

Views: 80

Answers (3)

i alarmed alien
i alarmed alien

Reputation: 9520

Your current method is dependent on the number of words in the strings; a better solution would be to use regular expressions to do the matching for you. The following version will work safely even if you have emphasized words that are substrings of other emphasized words (e.g. "cat" and "cat's cradle" or "cat-litter").

function applyHighlighingOnDisplayName($str1, $str2) {

    # if we have strings surrounded by <em> tags...
    if (preg_match_all("#<em>(.+?)</em>#", $str1, $match)) {

        ## sort the match strings by length, descending
        usort($match[1], function($a,$b){ return strlen($b) - strlen($a); } );

        # all the match words are in $match[1]
        foreach ($match[1] as $m) {
            # replace every match with a string that is very unlikely to occur
            # this prevents \b matching the start or end of <em> and </em>
            $str2 = preg_replace("#\b($m)\b#i",
                "ZZZZ$1ZZZZ",
                $str2);
        }
        # replace ZZZZ with the <em> tags
        return preg_replace("#ZZZZ(.*?)ZZZZ#", "<em>$1</em>", $str2);
    }
    return $str2;
}

$str1 = 'cup <em>cakes</em>';
$str2 = 'Cup Cakes';

print applyHighlighingOnDisplayName($str1, $str2) . PHP_EOL;

Output:

Cup <em>Cakes</em>
The Cup <em>Cakes</em>

Two strings with no <em>'d words:

$str1 = 'cup cakes';
$str2 = 'Cup Cakes';

print applyHighlighingOnDisplayName($str1, $str2) . PHP_EOL;

Output:

Cup Cakes

Now somethings rather trickier: lots of short words where one word is a substring of all the other words:

$str1 = '<em>i</em> <em>if</em> <em>in</em> <em>i\'ve</em> <em>is</em> <em>it</em>';

$str2 = 'I want to make the str2 as "World - is Round", by comparing which lowercase word in the str1 contains the em tag. So far, I\'ve done the following, but it fails if number of words aren\'t equal in both strings.';

Output:

<em>I</em> want to make the str2 as "World - <em>is</em> Round", by comparing which lowercase word <em>in</em> the str1 contains the em tag. So far, <em>I've</em> done the following, but <em>it</em> fails <em>if</em> number of words aren't equal <em>in</em> both strings.

Upvotes: 1

motanelu
motanelu

Reputation: 4025

Like others said, regex is the solution. Here is a working example with detailed comments:

$string1 = 'world <em>round</em>';
$string2 = 'World is - Round';

// extract what's in between <em> and </em> - it will be stored in $matches[1]
preg_match('/<em>(.+)<\/em>/i', $string1, $matches);

if (!$matches) {
    echo 'The first string does not contain <em>';
    exit();
}

// replace what we found in the previous operation
$newString = preg_replace('/\b' . preg_quote($matches[1], '\b/') . '/i', '<em>$0</em>', $string2);
echo $newString;

Details at:

Later edit - cover multiple cases:

$string1 = 'world <em>round</em> not <em>flat</em>';
$string2 = 'World is - Round not Flat! Round, ok?';

// extract what's in between <em> and </em> - it will be stored in $matches[1]
preg_match_all('/<em>(.+?)<\/em>/i', $string1, $matches);

if (!$matches) {
    echo 'The first string does not contain <em>';
    exit();
}

foreach ($matches[1] as $match) {
    // replace what we found in the previous operation
    $string2 = preg_replace('/\b' . preg_quote($match) . '\b/i', '<em>$0</em>', $string2);
}

echo $string2;

Upvotes: 1

Marc B
Marc B

Reputation: 360692

It's because your highlighting code is expecting a 1:1 correspondence between word positions in the two strings:

cup <em>cakes</em>
 1        2
Cup     Cakes

but on your incorrect sample:

cup <em>cakes</em>
 1        2            3
The      Cup         Cakes

e.g. you find <em> at word #2, so you highlight word #2 in the other string - but in that string, word #2 is Cup.

A better algorithm would be to strip the html from your original string, so you end up with just cup cakes. Then you look for cup cakes in the other string, and highlight the second word of that location. That'll compensate for any "motion" within the string caused by extra (or fewer) words.

Upvotes: 0

Related Questions