Reputation: 14250

Find largest cluster of a particular word in a block of text

I have a block of text (of arbitrary length) with a particular word highlighted in yellow whenever it appears. I want to show only a 400 word chunk of the text but I want to show the chunk with the most highlighted words.

Does anyone know a good algorithm for this?

I have the character position of each highlighted word, so the algorithm needs to find the densest cluster of unevenly spaced integers?

Upvotes: 3

Answers (4)

bbfire

Reputation: 3598

You have the indicies of the highlighted words.. I think the below is a good, fast approach as it doesn't require "finding" every word (to do the circular loop). To do this it uses a "chunk size" derived from a number of characters rather than words. You could then "round up" or "round down" to the nearest word ending and there you have your chunk.

The method to get how many highlighted indicies are within a "chunk size" ahead in your sample could be better done I think.

Pseudo

string GetHighestDensityChunk(){

// {chunk size} = 400 * average word length
// {possible start positions} = 0, highlighted indicies, and (sample - {chunk size})

int position
int bestPositionSoFar = 0
int maxHighLightedCountSoFar = 0


for each position in {possible start position}
{
    highlightedCount = GetNumberOfHighlightedWithinChunkSize(position)

    if(highlightedCount > maxHighLightedCountSoFar) 
    {
        maxHighLightedCountSoFar = highlightedCount
        bestPositionSoFar = position
    }
}

// "round up" to nearest word end
// gives index of next space after end of chunk starting from current best position
{revised chunk size} = sample.indexOf(' ', startingAt = bestPositionSoFar + {chunk size}) - bestPositionSoFar

return sample.substring(bestPositionSoFar, {revised chunk size})
}   


 int GetNumberOfHighlightedWithinChunkSize(position)
{
    numberOfHighlightedInRange = 0

    // starts from current position and scans forward counting highlighted indicies that are in range
    for(int i= {possible start position}.indexOf(position); i<= {possible start position}.length; i++){
        if({possible start position}[i] < position + {chunk size}){
            numberOfHighlightedInRange++;
        } else {
            break;
        }
    }
    return numberOfHighlightedInRange;
}

Upvotes: 1

sigint

Reputation: 1882

This isn't exactly what you asked for, but I've used something like this in the past, while searching for the words (charPos refers to the starting character position of the word). Note: the '/' operator does integer division, i.e. 4200/2000 = 2.

if hasKey(charPositionHashtable[charPos/2000]):
    charPositionHashtable[charPos/2000]) += 1
else:
    charPositionHashtable[charPos/2000]) = 1

After the search completes, charPositionHashtable has a bunch of key/value pairs containing an "index" to 2000-character chunks, and the count of found words in them. Take the max, and use the chunk corresponding to that index. This has the advantage of being better than O(n), I think (but I didn't do a lot of analysis on this).

Upvotes: 1

Victor

Reputation: 5807

I'm not sure how you know they're highlighted but here's a simple O(n) aproach that I'd try.

scan the words into a circular queue (max capacity of 400) and if they're highlighted then increment the counter, once you reach the queues capacity, dequeue words as is necesary to enqueue the next one. when you dequeue a highlighted word decrement the counter. keep track of the max that your counter reaches at all times and where this chunk of 400 words starts at the max.

not too elegant but fairly simple.

Upvotes: 6

Draemon

Reputation: 34711

you could do a word-by-word moving average (over the last 400 words), while keeping track of the maximum seen so far. Once you're done, your maximum tell you which 400 words to use.

Upvotes: 2

Find largest cluster of a particular word in a block of text

Answers (4)

Related Questions