Reputation: 14250
I have a block of text (of arbitrary length) with a particular word highlighted in yellow whenever it appears. I want to show only a 400 word chunk of the text but I want to show the chunk with the most highlighted words.
Does anyone know a good algorithm for this?
I have the character position of each highlighted word, so the algorithm needs to find the densest cluster of unevenly spaced integers?
Upvotes: 3
Views: 387
Reputation: 3598
You have the indicies of the highlighted words.. I think the below is a good, fast approach as it doesn't require "finding" every word (to do the circular loop). To do this it uses a "chunk size" derived from a number of characters rather than words. You could then "round up" or "round down" to the nearest word ending and there you have your chunk.
The method to get how many highlighted indicies are within a "chunk size" ahead in your sample could be better done I think.
Pseudo
string GetHighestDensityChunk(){
// {chunk size} = 400 * average word length
// {possible start positions} = 0, highlighted indicies, and (sample - {chunk size})
int position
int bestPositionSoFar = 0
int maxHighLightedCountSoFar = 0
for each position in {possible start position}
{
highlightedCount = GetNumberOfHighlightedWithinChunkSize(position)
if(highlightedCount > maxHighLightedCountSoFar)
{
maxHighLightedCountSoFar = highlightedCount
bestPositionSoFar = position
}
}
// "round up" to nearest word end
// gives index of next space after end of chunk starting from current best position
{revised chunk size} = sample.indexOf(' ', startingAt = bestPositionSoFar + {chunk size}) - bestPositionSoFar
return sample.substring(bestPositionSoFar, {revised chunk size})
}
int GetNumberOfHighlightedWithinChunkSize(position)
{
numberOfHighlightedInRange = 0
// starts from current position and scans forward counting highlighted indicies that are in range
for(int i= {possible start position}.indexOf(position); i<= {possible start position}.length; i++){
if({possible start position}[i] < position + {chunk size}){
numberOfHighlightedInRange++;
} else {
break;
}
}
return numberOfHighlightedInRange;
}
Upvotes: 1
Reputation: 1882
This isn't exactly what you asked for, but I've used something like this in the past, while searching for the words (charPos refers to the starting character position of the word). Note: the '/' operator does integer division, i.e. 4200/2000 = 2.
if hasKey(charPositionHashtable[charPos/2000]):
charPositionHashtable[charPos/2000]) += 1
else:
charPositionHashtable[charPos/2000]) = 1
After the search completes, charPositionHashtable
has a bunch of key/value pairs containing an "index" to 2000-character chunks, and the count of found words in them. Take the max, and use the chunk corresponding to that index. This has the advantage of being better than O(n), I think (but I didn't do a lot of analysis on this).
Upvotes: 1
Reputation: 5807
I'm not sure how you know they're highlighted but here's a simple O(n) aproach that I'd try.
scan the words into a circular queue (max capacity of 400) and if they're highlighted then increment the counter, once you reach the queues capacity, dequeue words as is necesary to enqueue the next one. when you dequeue a highlighted word decrement the counter. keep track of the max that your counter reaches at all times and where this chunk of 400 words starts at the max.
not too elegant but fairly simple.
Upvotes: 6
Reputation: 34711
you could do a word-by-word moving average (over the last 400 words), while keeping track of the maximum seen so far. Once you're done, your maximum tell you which 400 words to use.
Upvotes: 2