Regular Expression doesn't work out

Question

In my approach I have the following String extracted from the Source Code Plange Mühle 1, 40221 Düsseldorf What I want to do now is to filter out everything until the comma, then the number, then the word again. What I did was to do this regex which doesn't work.

^(.*?),\s*([0-9]{5})\s*(\S*)

But this should normally work, right? So I started to delete everything again and just go part by part. When I Only had the part, where it takes everything until the comma it worked but only for the data before the comma of course.. But then, after the first whitespace, the problems ocurred. When I now did anything afterwards, like [^0-9] a sign appeared. Tis one right here �

array(2) { [0]=> string(19) "Plange Mühle 1,�" [1]=> string(17) "Plange Mühle 1" }

The question mark? What is it? I don't know where it comes from. My documents are all UTF-8 encoded, so I'm wondering myself how it came there, and how I can solve this problem now. Do you guys have any ideas what it is and how I can solve it?

Tim Pietzcker · Accepted Answer

The regex looks OK, but are you using it correctly? With Unicode strings, you should be setting the /u parameter. Now this is just speculation, but perhaps there's a non-ASCII whitespace character somewhere in the string that your regex doesn't recognize correctly, cutting the multi-byte sequence in half and giving you an invalid character as a result.

So perhaps this works:

preg_match('/^(.*?),\s*([0-9]{5})\s*(\S*)/u', $subject, $regs)

Regular Expression doesn't work out

Answers (2)

Related Questions

Regular Expression doesn&#39;t work out

Answers (2)

Related Questions

Regular Expression doesn't work out