Luis Martin
Luis Martin

Reputation: 973

I can't capture a strange character with regular expressions

I'm having problems to capture and filter out a strange character some data came with, which is causing JSON data which include it to not being correctly parsed. I don't know why, since it's not included in the white list I created with this regular expression:

$string = preg_replace('/[^\w\dñÑáéíóúÁÉÍÓÚüܺª\-_\/\s\\<>,;:.*\[\]\(\)+?¿!&%@=]/', '', $string);

Testing regular expression on Regexr. As you will see, this strange character is not captured Testing regular expression on Regexr

This is how it is displayed in the browser:

This is how it is displayed in the browser

And this is how it's displayed in Pluma (a Linux editor):

this is how it's displayed in Pluma (a Linux editor)

When I copy it and try to insert it in Google, for instance, nothing is inserted. Really strange. I've never bumped into any situation like this.

Any idea on how to handle it?

Upvotes: 0

Views: 340

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

This mysterious character isn't so mysterious but is difficult to display with your editor since it is the page-breaking control character: Form Feed (\x0C see the ascii table)

This character is contained in the \s character class, that's why your pattern doesn't match it.

A solution consists to remove the \s from your pattern and to replace it with an exhaustive list of allowed whitespace characters.

To make things easier you can already put the class \h (if supported) that contains all horizontal whitespaces. Then add the vertical whitespaces you want by hand.

Note that if you are working with the Windows-1252 code page, keep it under the eyes to be sure not to forget anything and to shorten the pattern using character ranges.

Upvotes: 2

Luis Martin
Luis Martin

Reputation: 973

I got it!

Turns out this character represents a form feed

It's included in \s like white space, \t, \r or \n.

By being more specific, I achieved what I wanted. The new regex:

/[^\w\dñÑáéíóúÁÉÍÓÚüܺª\-_\/ \r\n\\<>,;:.*\[\]\(\)+?¿!&%@=]/

Upvotes: 1

Related Questions