clarkk
clarkk

Reputation: 27725

regepx - odd whitespaces in string

I'm doing some regexp on some strings and in my pattern I match for whitespaces \s

But in some strings I experience some strange spaces.. When converted to hex a0

How to convert all strange spaces to a normal space so it can be detected with regexp and both and \s?

When the string is presented as UTF8 all a0 chars are represented as a

input in HEX

a03535a03832a03834a03135a02da053452e6e723aa0444ba03132a03638a03336a03933

input as string

 55 82 84 15 - SE.nr: DK 12 68 36 93

Upvotes: 3

Views: 75

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You do not need to add the non-breaking space to the [\s] character class, \s can match any Unicode whitespace if you use a /u modifier:

'/\s/u'

See the regex demo

From pcre.org:

The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32)... If PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types: \s any character that matches \p{Z} or \h or \v

The PCRE_UCP verb and Unicode semantics are enabled with the /u modifier.

Upvotes: 3

hsz
hsz

Reputation: 152266

a0 is a representation of   - non-breaking space.

You can match it with:

[\s\xA0]

Upvotes: 4

Related Questions