Aamir Mahmood
Aamir Mahmood

Reputation: 2724

PHP Remove garbage from string

I am stuck on a problem, I am using a very basic RTE to get user input, and trimming the garbage from the string, when that is posted using the functions provided with RTE. I am using http://premiumsoftware.net/cleditor

After user submits the data, I parse it with PHP and remove the unwanted content. Most of the users are Linux / Mac users, and they usually copy content from emails/word documents and paste that in RTE, causing lots of garbage.

We also need to allow all the UTF8 chars from any language.

Saying all this, please check this image

enter image description here

As you can see, in the color notes there is not special char visible, and if I copy this from MYSQL and paste it any where, there will be no garbage. But if I turn the values to HEX you can see, a strange char is there. Highlighted with yellow.

Is there any way to filter these kind of issues. It is causing my PDF generation script to stop working

Upvotes: 5

Views: 2117

Answers (2)

David Dierick
David Dierick

Reputation: 69

As you're saying that it breaks your PDF generation script, and as this is a rather normal control character (U+2028) .

I'd say that the one thing to check first is how strict or maybe misconfigured your PDF script is, regarding the character encoding(s) it shall or can use.

-- edit - deceze said it in his edit -- :-)

Upvotes: 0

deceze
deceze

Reputation: 522015

That is not "garbage", it's the Line Separator character U+2028 encoded in UTF-8. It only looks like garbage if you interpret it in ASCII/Latin-1, the way everything looks like garbage when interpreted with the wrong character set. There's nothing as such to remove. If you decide that you want to remove certain superfluous characters, feel free to do so. But they're part of the original content and they're not "wrong" per se, so there's no general advice to give here.

If your PDF generator chokes on it, figure out why. Maybe it's just generally not handling Unicode correctly, in which case you need to fix that if you want to support Unicode with it. If it does have specific characters it chokes on (which would be weird), then you need to figure out what these characters are exactly and strip them.

Upvotes: 8

Related Questions