Reputation: 973
I'm having problems to capture and filter out a strange character some data came with, which is causing JSON data which include it to not being correctly parsed. I don't know why, since it's not included in the white list I created with this regular expression:
$string = preg_replace('/[^\w\dñÑáéíóúÁÉÍÓÚüܺª\-_\/\s\\<>,;:.*\[\]\(\)+?¿!&%@=]/', '', $string);
Testing regular expression on Regexr. As you will see, this strange character is not captured
This is how it is displayed in the browser:
And this is how it's displayed in Pluma (a Linux editor):
When I copy it and try to insert it in Google, for instance, nothing is inserted. Really strange. I've never bumped into any situation like this.
Any idea on how to handle it?
Upvotes: 0
Views: 340
Reputation: 89557
This mysterious character isn't so mysterious but is difficult to display with your editor since it is the page-breaking control character: Form Feed (\x0C
see the ascii table)
This character is contained in the \s
character class, that's why your pattern doesn't match it.
A solution consists to remove the \s
from your pattern and to replace it with an exhaustive list of allowed whitespace characters.
To make things easier you can already put the class \h
(if supported) that contains all horizontal whitespaces. Then add the vertical whitespaces you want by hand.
Note that if you are working with the Windows-1252 code page, keep it under the eyes to be sure not to forget anything and to shorten the pattern using character ranges.
Upvotes: 2
Reputation: 973
I got it!
Turns out this character represents a form feed
It's included in \s like white space, \t, \r or \n.
By being more specific, I achieved what I wanted. The new regex:
/[^\w\dñÑáéíóúÁÉÍÓÚüܺª\-_\/ \r\n\\<>,;:.*\[\]\(\)+?¿!&%@=]/
Upvotes: 1