Reputation: 11972
I have an array:
$myArray=array(
'hello my name is richard',
'hello my name is paul',
'hello my name is simon',
'hello it doesn\'t matter what my name is'
);
I need to find the sub string (min 2 words) that is repeated the most often, maybe in an array format, so my return array could look like this:
$return=array(
array('hello my', 3),
array('hello my name', 3),
array('hello my name is', 3),
array('my name', 4),
array('my name is', 4),
array('name is', 4),
);
So I can see from this array of arrays how often each string was repeated amongst all strings in the array.
Is the only way to do it like this?..
function repeatedSubStrings($array){
foreach($array as $string){
$phrases=//Split each string into maximum number of sub strings
foreach($phrases as $phrase){
//Then count the $phrases that are in the strings
}
}
}
I've tried a solution similar to the above but it was too slow, processing around 1000 rows per second, can anyone do it faster?
Upvotes: 6
Views: 2605
Reputation: 2357
A solution to this might be
function getHighestRecurrence($strs){
/*Storage for individual words*/
$words = Array();
/*Process multiple strings*/
if(is_array($strs))
foreach($strs as $str)
$words = array_merge($words, explode(" ", $str));
/*Prepare single string*/
else
$words = explode(" ",$strs);
/*Array for word counters*/
$index = Array();
/*Aggregate word counters*/
foreach($words as $word)
/*Increment count or create if it doesn't exist*/
(isset($index[$word]))? $index[$word]++ : $index[$word] = 1;
/*Sort array hy highest value and */
arsort($index);
/*Return the word*/
return key($index);
}
Upvotes: 5
Reputation: 6886
While this has a higher runtime, I think it's simpler from an implementation perspective:
$substrings = array();
foreach ($myArray as $str)
{
$subArr = explode(" ", $str);
for ($i=0;$i<count($subArr);$i++)
{
$substring = "";
for ($j=$i;$j<count($subArr);$j++)
{
if ($i==0 && ($j==count($subArr)-1))
break;
$substring = trim($substring . " " . $subArr[$j]);
if (str_word_count($substring, 0) > 1)
{
if (array_key_exists($substring, $substrings))
$substrings[$substring]++;
else
$substrings[$substring] = 1;
}
}
}
}
arsort($substrings);
print_r($substrings);
Upvotes: 2
Reputation: 31823
This should run in O(n) time
$twoWordPhrases = function($str) {
$words = preg_split('#\s+#', $str, -1, PREG_SPLIT_NO_EMPTY);
$phrases = array();
foreach (range(0, count($words) - 2) as $offset) {
$phrases[] = array_slice($words, $offset, 2);
}
return $phrases;
};
$frequencies = array();
foreach ($myArray as $str) {
$phrases = $twoWordPhrases($str);
foreach ($phrases as $phrase) {
$key = join('/', $phrase);
if (!isset($frequencies[$key])) {
$frequencies[$key] = 0;
}
$frequencies[$key]++;
}
}
print_r($frequencies);
Upvotes: 0
Reputation: 53349
I'm assuming by "substring" you really mean "substring split along word boundaries" since that's what your example shows.
In that case, assuming any maximum repeated substring will do (since there may be ties), you can always choose just a single word as a maximum repeated substring, if you think about it. For any phrase "A B", the phrases "A" and "B" individually must occur at least as often as "A B" because they both occur every time "A B" does and they may occur at other times. Therefore, a single word must be have a count that at least ties with any substring that contains that word.
So you just need to split all phrases into a set of unique words, and then just count the words and return one of the words with the highest count. This will run way faster than actually counting every possible substring.
Upvotes: 1