Trevor
Trevor

Reputation: 2457

PHP extra whitespace not being deleted

I'm counting words in an article and removing common words such as "and" or "the". I"m removing them by use of preg_replace

after it is done I do a quick clean of extra white space by using.

$search_body = preg_replace('/\s+/',' ',$search_body);

However I've got some very stubborn white space that will not go away. I've tried

if($word == "" OR $word == " "){
  //chop it's head off
}

But the if statement does not see $word as being just whitespace. I've also tried printing it to the screen to get the raw data type of it and it's still just showing up blank.

Here is the full regex that I'm using.

$pattern = array(   
        '/\&quot\;/',
        '/[0-9]/',
        '/\,/',
        '/\./',
        '/\!/',
        '/\@/',
        '/\#/',
        '/\$/',
        '/\%/',
        '/\^/',
        '/\&/',
        '/\*/',
        '/\(/',
        '/\)/',
        '/\_/',
        '/\"/',
        '/\'/',
        '/\:/',
        '/\;/',
        '/\?/',
        '/\`/',
        '/\~/',
        '/\[/',
        '/\]/',
        '/\{/',
        '/\}/',
        '/\|/',
        '/\+/',
        '/\=/',
        '/\-/',
        '/–/',
        '/°/',
        '/\bthe\b/',
        '/\band\b/',
        '/\bthat\b/',
        '/\bhave\b/',
        '/\bfor\b/',
        '/\bnot\b/',
        '/\bwith\b/',
        '/\byou\b/',
        '/\bthis\b/',
        '/\bbut\b/',
        '/\bhis\b/',
        '/\bfrom\b/',
        '/\bthey\b/',
        '/\bsay\b/',
        '/\bher\b/',
        '/\bshe\b/',
        '/\bwill\b/',
        '/\bone\b/',
        '/\ball\b/',
        '/\bwould\b/',
        '/\bthere\b/',
        '/\btheir\b/',
        '/\bwhat\b/',
        '/\bout\b/',
        '/\babout\b/',
        '/\bwho\b/',
        '/\bget\b/',
        '/\bwhich\b/',
        '/\bwhen\b/',
        '/\bmake\b/',
        '/\bcan\b/',
        '/\blike\b/',
        '/\btime\b/',
        '/\bjust\b/',
        '/\bhim\b/',
        '/\bknow\b/',
        '/\btake\b/',
        '/\bpeople\b/',
        '/\binto\b/',
        '/\byear\b/',
        '/\byour\b/',
        '/\bgood\b/',
        '/\bsome\b/',
        '/\bcould\b/',
        '/\bthem\b/',
        '/\bsee\b/',
        '/\bother\b/',
        '/\bthan\b/',
        '/\bthen\b/',
        '/\bnow\b/',
        '/\blook\b/',
        '/\bonly\b/',
        '/\bcome\b/',
        '/\bits\b/', //it's?
        '/\bover\b/',
        '/\bthink\b/',
        '/\balso\b/',
        '/\bback\b/',
        '/\bafter\b/',
        '/\buse\b/',
        '/\btwo\b/',
        '/\bhow\b/',
        '/\bour\b/',
        '/\bwork\b/',
        '/\bfirst\b/',
        '/\bwell\b/',
        '/\bway\b/',
        '/\beven\b/',
        '/\bnew\b/',
        '/\bwant\b/',
        '/\bbecause\b/',
        '/\bany\b/',
        '/\bthese\b/',
        '/\bgive\b/',
        '/\bday\b/',
        '/\bmost\b/',
        '/\bare\b/',
        '/\bwas\b/',
        '/\<\w+\>/', '/\<\/\w+\>/',
        '/\b\w{1}\b/', //1 letter word
        '/\b\w{2}\b/', //2 letter word
        '/\//',
        '/\</',       
        '/\>/'
        );

$search_body    = strip_tags($body);
$search_body    = strtolower($search_body);
$search_body    = preg_replace($pattern, ' ', $search_body);
$search_body    = preg_replace('/\s+/',' ',$search_body);
$search_body    = explode(" ", $search_body);

When exploded blank values show up left and right

Example text that I am using is too long to post here. But I copied and pasted This article to give it a test and it showed 32 counts of white space, not including the white space in front of or behind of other words even after using trim().

Here's a js.fiddle of the raw data that is being handled by php.

htmlentities and htmlspecialchars also show nothing.

Here's the code counts all the values and puts them into one.

$inhere     = array();
$body_hold  = array();
foreach($search_body as $value){
  $value = trim($value);
  if(in_array($value, $inhere) && $value != ""){
    $key = array_search($value, $inhere);
    $body_hold[$key]['count'] = $body_hold[$key]['count']+1;
  }elseif($value != ""){
    $inhere[] = $value;
    $body_hold[] = array(
      'count'  => 1,
      'word'   => $value
    );
  }
}
rsort($body_hold);

Basic foreach to see values.

foreach($body_hold as $value){
  $count  = $value['count'];
  $word   = trim($value['word']);
  echo "Count: ".$count;
  echo " Word: ".$word;
  echo '<br>';
}

Here's a PHP example of what it's returning

Upvotes: 1

Views: 104

Answers (2)

Tengiz
Tengiz

Reputation: 1920

This character 160 looks like space but it's not, replacing all of them to the regular spaces (32) and then removing all the double spaces will fix your problem.

$search_body = str_replace(chr(160), chr(32), $search_body);
$search_body = trim(preg_replace('/\s+/', ' ', $search_body));

Upvotes: 0

romulusnr
romulusnr

Reputation: 121

Are you sure you put the exact same data you're processing in the js.fiddle? Or did you get it from a subsequent post-processed step?

It's obviously a Wikipedia article. I went to that article on Wikipedia and opened it in Edit mode, and saw that there are &nbsp;s in the raw wikitext. However, those nbsp's don't appear in your js.fiddle data.

TL;DR: Check for &nbsp; in your processing (and convert to spaces, etc.).

Upvotes: 1

Related Questions