Reputation: 3346
I am scraping the DOM of a static site with PHP and pulling out specific bit's of data so I can put stuff into a database.
For this example I am storing the inner HTML of an element to $domString
, I can see the string is 'Description', but when I compare $domString
to 'Description' in the code there isn't a match.
if($domString == 'Description') {
// This is not happening, even though I know
// $domString contains 'Description' :(
}
I have striped whitespace and stuff, when I var_dump()
them both out I get this:
string(45) "Description"
string(11) "Description"
Running them both through bin2hex()
as Álvaro G. Vicario suggests returns the following two values respectively:
3c74642076616c69676e3d22746f702220636f6c7370616e3d2232223e4465736372697074696f6e3c2f74643e
4465736372697074696f6e
I need a way to strip wahtever is beefing that first string out.
Upvotes: 1
Views: 3641
Reputation: 29
Solution is to use a regex like this
function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
return preg_replace('/[^A-Za-z0-9\-\;\,\?\*\%\@\$\!\(\)\#\=\&]/', '', $string); // Removes special chars
}
Adapt it to the special char you need or not add the one you want to keep catching like this \#
or esle \=
Upvotes: 0
Reputation: 146630
The number in parenthesis is the total byte count. Obviously, a 45-byte string cannot be identical to a 11-byte one.
You can use bin2hex() to inspect the exact bytes. I also suggest you don't see the output as HTML—In most browsers you can hit Ctrl+U.
Edit: asking why two given strings render the same words after being processed by a web browser is better answered by actually looking at the real raw data (as opposed to just looking at the output produced by the browser).
Edit #2:
var_dump( hex2bin('3c74642077696474683d223832222076616c69676e3d22746f70223e547970653c2f74643e') );
... prints this:
string(37) "<td width="82" valign="top">Type</td>"
Do you want to strip HTML tags or something? Did you see the raw HTML?
Upvotes: 4
Reputation: 1687
You should as question why this one happens
string(45) "Description"
string(11) "Description"
Second one is 11 chars, first one is 45! Why? So there are some hidden (not showed) characters\symbols. That's why this strings not equal.
Try this one Remove control characters from php String
Upvotes: 0