mrpatg
mrpatg

Reputation: 10117

Removing invisible characters from UTF-8 XML data

I am consuming an XML feed which contains a great deal of whitespace. When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.

I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:

string(17) "72"

Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.

I did recieve the following error:

simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74

Upvotes: 0

Views: 1583

Answers (3)

mrpatg
mrpatg

Reputation: 10117

Solution

My very hacky workaround that works:

$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);

urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.

Upvotes: 0

Mike Weir
Mike Weir

Reputation: 3189

Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.

My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.

Upvotes: 1

Joe Stanton
Joe Stanton

Reputation: 61

I just found this regex (untested)

$xml_data = preg_replace("/>\s+</", "><", $xml_data);

If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here: http://php.net/manual/en/function.xml-parser-set-option.php

Upvotes: 1

Related Questions