Reputation: 1764
I have a function that strips html and places the words in an array and then uses array_count_values. Im trying to report the number of occurences of each word. The array outputed is very messy. I tried to clean it up, and I'm getting nowhere. I want to remove telephone numbers, and for some reason phrases are pushed together. Also the first array seems to be null, but isset() or empty() doesn't seem to unset it.
$body = $this->get_response($domain);
$body = preg_replace('/<body(.*?)>/i', '<body>', $body);
$body = preg_replace('#</body>#i', '</body>', $body);
$openTag = '<body>';
$start = strpos($body, $openTag);
$start += strlen($openTag);
$closeTag = '</body>';
$end = strpos($body, $closeTag);
// Return if cannot cut-out the body
if ($end <= $start || $start === false || $end === false) {
$this->setValue('');
return;
}
$body = substr($body, $start, $end - $start);
$body = preg_replace(array(
'@<script[^>]*?>.*?</script>@si', // Strip out javascript
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA
'/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute
), '', $body);
$body = strip_tags($body);
$body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;'));
$body = array_map('trim', $body);
$words = $body;
$i = 0;
$words = array_count_values($words);
foreach($words as $word){
if (empty($word)) unset($words[$i]);
$i++;
}
echo "<pre>";
print_r($words);
echo "</pre>";
outputs
Array
(
[] => 28
[333.444.5555] => 1
[facebook] => 2
[twitter] => 2
[linkedin] => 2
[youtube
googleplus] => 1
[About
History
Our] => 1
[Mission
Who] => 1
[This
That
Other] => 1
[Us
English
FA
Football] => 1
[Media
Pay] => 2
[Per] => 4
[Think
Fast] => 2
[Marketing
Design] => 1
[Consulting
Case] => 2
Upvotes: 0
Views: 121
Reputation: 111239
I'm afraid explode(' ', $body)
is not enough because space is not the only white space character. Try preg_split
instead.
$body = array_filter(preg_split('/\s+/', $body),
create_function('$str', 'return strlen($str) > 2;'));
Upvotes: 1