Firas Dib
Firas Dib

Reputation: 2621

Highlight match result in subject string from preg_match_all()

I am trying to highlight the subject string with the returned $matches array from preg_match_all(). Let me start off with an example:

preg_match_all("/(.)/", "abc", $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

This will return:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => a
                    [1] => 0
                )

            [1] => Array
                (
                    [0] => a
                    [1] => 0
                )

        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] => b
                    [1] => 1
                )

            [1] => Array
                (
                    [0] => b
                    [1] => 1
                )

        )

    [2] => Array
        (
            [0] => Array
                (
                    [0] => c
                    [1] => 2
                )

            [1] => Array
                (
                    [0] => c
                    [1] => 2
                )

        )

)

What I want to do in this case is to highlight the overall consumed data AND each backreference.

Output should look like this:

<span class="match0">
    <span class="match1">a</span>
</span>
<span class="match0">
    <span class="match1">b</span>
</span>
<span class="match0">
    <span class="match1">c</span>
</span>

Another example:

preg_match_all("/(abc)/", "abc", $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

Should return:

<span class="match0"><span class="match1">abc</span></span>

I hope this is clear enough.

I want to highlight overall consumed data AND highlight each backreference.

Thanks in advance. If anything is unclear, please ask.

Note: It must not break html. The regex AND input string are both unknown by the code and completely dynamic. So the search string can be html and the matched data can contain html-like text and what not.

Upvotes: 4

Views: 1910

Answers (4)

Firas Dib
Firas Dib

Reputation: 2621

I am not too familiar with posting on stackoverflow so I hope I don't mess this up. I do this in almost the same way as @IMSoP, however, slightly different:

I store the tags like this:

$tags[ $matched_pos ]['open'][$backref_nr] = "open tag";
$tags[ $matched_pos + $len ]['close'][$backref_nr] = "close tag";

As you can see, almost identical to @IMSoP.

Then I construct the string like this, instead of inserting and sorting like @IMSoP does:

$finalStr = "";
for ($i = 0; $i <= strlen($text); $i++) {
    if (isset($tags[$i])) {
        foreach ($tags[$i] as $tag) {
            foreach ($tag as $span) {
                $finalStr .= $span;
            }
        }
    }
    $finalStr .= $text[$i];
}

Where $text is the text used in preg_match_all()

I think my solution is slightly faster than @IMSoP's since he has to sort every time and what not. But I am not sure.

My main worry right now is performance. But it might just not be possible to make it work any faster than this?

I have been trying to get a recursive preg_replace_callback() thing going, but I've not been able to make it work so far. preg_replace_callback() seems to be very, very fast. Much faster than what I am currently doing anyway.

Upvotes: 0

IMSoP
IMSoP

Reputation: 97898

This seems to behave right for all the examples I've thrown at it so far. Note that I've broken the abstract highlighting part from the HTML-mangling part for reusability in other situations:

<?php

/**
 * Runs a regex against a string, and return a version of that string with matches highlighted
 * the outermost match is marked with [0]...[/0], the first sub-group with [1]...[/1] etc
 *
 * @param string $regex Regular expression ready to be passed to preg_match_all
 * @param string $input
 * @return string
 */
function highlight_regex_matches($regex, $input)
{
    $matches = array();
    preg_match_all($regex, $input, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

    // Arrange matches into groups based on their starting and ending offsets
    $matches_by_position = array();
    foreach ( $matches as $sub_matches )
    {
            foreach ( $sub_matches as $match_group => $match_data )
            {
                    $start_position = $match_data[1];
                    $end_position = $start_position + strlen($match_data[0]);

                    $matches_by_position[$start_position]['START'][] = $match_group;

                    $matches_by_position[$end_position]['END'][] = $match_group;
            }
    }

    // Now proceed through that array, annotoating the original string
    // Note that we have to pass through BACKWARDS, or we break the offset information
    $output = $input;
    krsort($matches_by_position);
    foreach ( $matches_by_position as $position => $matches )
    {
            $insertion = '';

            // First, assemble any ENDING groups, nested highest-group first
            if ( is_array($matches['END']) )
            {
                    krsort($matches['END']);
                    foreach ( $matches['END'] as $ending_group )
                    {
                            $insertion .= "[/$ending_group]";
                    }
            }

            // Then, any STARTING groups, nested lowest-group first
            if ( is_array($matches['START']) )
            {
                    ksort($matches['START']);
                    foreach ( $matches['START'] as $starting_group )
                    {
                            $insertion .= "[$starting_group]";
                    }
            }

            // Insert into output
            $output = substr_replace($output, $insertion, $position, 0);
    }

    return $output;
}

/**
 * Given a regex and a string containing unescaped HTML, return a blob of HTML
 * with the original string escaped, and matches highlighted using <span> tags
 *
 * @param string $regex Regular expression ready to be passed to preg_match_all
 * @param string $input
 * @return string HTML ready to display :)
 */
function highlight_regex_as_html($regex, $raw_html)
{
    // Add the (deliberately non-HTML) highlight tokens
    $highlighted = highlight_regex_matches($regex, $raw_html);

    // Escape the HTML from the input
    $highlighted = htmlspecialchars($highlighted);

    // Substitute the match tokens with desired HTML
    $highlighted = preg_replace('#\[([0-9]+)\]#', '<span class="match\\1">', $highlighted);
    $highlighted = preg_replace('#\[/([0-9]+)\]#', '</span>', $highlighted);

    return $highlighted;
}

NOTE: As hakra has pointed out to me on chat, if a sub-group in the regex can occur multiple times within one overall match (e.g. '/a(b|c)+/'), preg_match_all will only tell you about the last of those matches - so highlight_regex_matches('/a(b|c)+/', 'abc') returns '[0]ab[1]c[/1][/0]' not '[0]a[1]b[/1][1]c[/1][/0]' as you might expect/want. All matching groups outside that will still work correctly though, so highlight_regex_matches('/a((b|c)+)/', 'abc') gives '[0]a[1]b[2]c[/2][/1][/0]' which is still a pretty good indication of how the regex matched.

Upvotes: 3

hakre
hakre

Reputation: 198109

Reading your comment under the first answer, I'm pretty sure you did not really formulated the question as you intended to. However following to what you ask for in concrete that is:

$pattern = "/(.)/";
$subject = "abc";

$callback = function($matches) {
    if ($matches[0] !== $matches[1]) {
        throw new InvalidArgumentException(
            sprintf('you do not match thee requirements, go away: %s'
                    , print_r($matches, 1))
        );
    }
    return sprintf('<span class="match0"><span class="match1">%s</span></span>'
                   , htmlspecialchars($matches[1]));
};
$result = preg_replace_callback($pattern, $callback, $subject);

Before you now start to complain, take a look first where your shortcoming in describing the problem is. I have the feeling you actually want to actually parse the result for matches. However you want to do sub-matches. That does not work unless you parse as well the regular expression to find out which groups are used. That is not the case so far, not in your question and also not in this answer.

So please this example only for one subgroup which must also be the whole pattern as an requirement. Apart from that, this is fully dynamic.

Related:

Upvotes: 0

Ogelami
Ogelami

Reputation: 365

A quick mashup, why use regex?

$content = "abc";
$endcontent = "";

for($i = 0; $i > strlen($content); $i++)
{
    $endcontent .= "<span class=\"match0\"><span class=\"match1\">" . $content[$i] . "</span></span>";
}

echo $endcontent;

Upvotes: -1

Related Questions