Search in Textblock for shortest passage of given keyword

Question

i have a task and i am not sure how i should solve the problem. I have an idead but i do not know if it is the best way to solve it.

Here is the task: Given is a block of text and some keywords to find. We need to find a passage where all words can be found and where least words are used. Only the letters from A-Z and a-z need to be taken into account.

Here is an example:
Textblock:
Ein toller Beispieltext ist der Blindtext. Er hat ein paar Wörter. Dies ist ein Beispieltext, der ein paar Wörter hat und auch noch ein paar mehr, um die Zeile etwas länger zu machen. Darüber hinaus ist er nur dafür da, um genügend Testtext zusammenzubekommen. Dem Text selbst macht das nicht so viel aus. Früher einmal mehr, als er noch nicht so selbstbewusst war. Heute kennt er seine Rolle als Blindtext und fügt sich selbstbewusst ein. Er ist ja irgendwie wichtig. Manchmal jedoch, ganz manchmal, weint er in der Nacht, weil er niemals bis zum Ende gelesen wird. Doch das hat ja jetzt zum Glück ein Ende.

And here the words that need to be found: ein Beispieltext der paar Wörter

The result would be Beispieltext der ein paar Wrter

Following passage would also be a passage where all the words can be found but it has more words inside of the passage and therefore is not the solution: Ein toller Beispieltext ist Blindtext. Er hat ein paar Wörter.

My idea is to cut all unnecessary letters and then split the textblock on spaces to have an array of all words. so i can get the position of the words and calculate how much words are inbetween the first occurence of one of the searched words and the first occurence of all other searched words. this way i would need to go through the whole array and compare all possible lengths of passages and just take the shortest one.

do you think this is the best approach or can you point me to a better idea how to solve this problem?

trincot · Accepted Answer

The algorithm you describe is could be OK but it is less clearly specified when it comes to "... this way I would need to go through the whole array".

Once you have done the clean up and splitting into words, It will be easier to create a map for the key words, so you can know quickly if a word from the text matches (with isset()). Then you could just reduce the text array to an array of matching words (using array_filter()), still keeping the index of where they appear in the original array of words.

The algorithm would then walk through that reduced array and keep track of a window (range) of these words. At the right side that window enlarges as long as not all necessary words are inside of it, and it shrinks at the left side when the left side word already occurs elsewhere in the window, or just after you have found a candidate solution. That way your window will travel through the whole (reduced) text array. You'll keep track only of the window that represents the shortest phrase. So at the end you have the optimal solution and just need to translate the window boundaries back to a phrase taken from the original text array.

Case insensitive matching can be achieved by storing things in lower case (with strtolower), and by using the original cased string (in array format) for generating the output.

Here is a function that implements the above described algorithm:

function findFragment($text, $words) {
    // Remove non-A-Z letters
    $text = preg_replace("/[^a-z ]/i", "", $text);
    $words = preg_replace("/[^a-z ]/i", "", $words);
    // Create a map keyed by the words to find, with as value 
    // the number of occurrences in current sub-phrase
    $words_map = array_fill_keys(str_word_count(strtolower($words), 2), 0);
    // Put all words of text in an array
    $text_arr = str_word_count($text, 1);
    $text_low_arr = str_word_count(strtolower($text), 1);
    // Filter only matching words from the text, keeping their original indexes.
    $matches = array_filter($text_low_arr, function ($word) use ($words_map) {
        return isset($words_map[$word]);
    });
    // How many distinct words need to be matched to have a candidate phrase
    $matches_left = count($words_map);
    // Keep track of how long the shortest phrase is
    $min_words = count($text_arr) + 1; // start "infinite"
    // Loop over all matching words as the last word of a possible phrase
    foreach($matches as $i => $match) {
        $phrase[$i] = $match; // Add to the phrase
        $words_map[$match]++; // Increase count for this particular word
        if ($words_map[$match] > 1) continue; // Nothing new was added
        // Additional word found
        $matches_left--;
        if ($matches_left) continue; // Still need more words
        // Phrase has all words
        // Remove words from left which occur elsewhere in the phrase
        while ($words_map[reset($phrase)] > 1) {
            $words_map[reset($phrase)]--;
            unset($phrase[key($phrase)]);
        }
        // How many words are in this phrase?
        $num_words = $i - key($phrase) +1;
        if ($num_words < $min_words) {
            // It is shorter than we had so far
            $min_words = $num_words;
            $best_start = key($phrase);
        }
        // Remove first word from phrase before finding new candidate phrases
        $words_map[reset($phrase)]--;
        unset($phrase[key($phrase)]);
        $matches_left++;
    }
    // return best result
    return implode(" ", array_slice($text_arr, $best_start, $min_words));
}

You would call it like this:

echo findFragment($text, $words);

For the sample data you have given in the question, it returns the desired answer:

Beispieltext der ein paar Wrter

See it run on eval.in.

Search in Textblock for shortest passage of given keyword

Answers (2)

Foreward

Description

Example

PHP Word Count

Related Questions