regexp to break text into array by spaces, punctuation marks, linebreaks

Question

I need to break texts into array by spaces, punctuation marks, linebreaks. Here is the sample text:

A man’s jacket is of green color. He – the biggest star in modern history – rides bikes very fast (230 km per hour). How is it possible?! What kind of bike is he using? The semi-automatic gear of his bike, which is quite expensive, significantly helps to reach that speed. Some (or maybe many) claim that he is the fastest in the world! “I saw him ride the bike!” Mr. John Deer speaks. “The speed he sets is 133.78 kilometers per hour,” which sounds incredible; sounds deceiving.

I've already got the regex that does that:

preg_split('/(?<=\s)|(?<=\w)(?=[.,:;!?()-])|(?<=[.,!()?\x{201C}])(?=[^ ])/u', $text);

But currently it splits the following semi-automatic into two words, while it has to remain one. If there are spaces aside of dash, as in semi - automatic, then this should be three words. I don't quite understand how this regexp is working so any help is appreciated.

The second problem is that if the text contains line breaks, it catches line breaks but also creates redundant element. See the example - elements [8] and [9]. Element [8] is redundant. How can I work around it?

gwillie · Accepted Answer

I haven't tested the following.

First lets change the regex:

/[.,:;!?()\s]|(?<=\s)-(?=\s)/u

Explained:

[.,:;!?()\s] - split on punctuation

|(?<=\s)-(?=\s) - (alternate) split on - that has a space either side of -

Next, do an array_filter() on the result, removing empty|false elements

EDIT:

To keep punctuation use:

/(?=[.,:;!?()\s])|(?<=\s)-(?=\s)/u

I just surrounded the character class with a lookahead

EDIT 2:

/\s|(?=[.,:;!?)])|(?<=\s[("])|(?<=\s)-(?=\s)/u

EDIT 3:

\s|(?<=\s)-(?=\s)|(?<=\w)(?=[.,:;!?])|(?<=[.,"!()?\x{201C}])(?=[^ ])

EDIT 4:

\s|(?<=\s)-(?=\s)|(?<=\w)(?=[.,:;!?)])|(?<=[.,"!()?\x{201C}])(?=[^ ])

EXPLAINED:

Oh my my, me head wasn't in the game today. Your regex was nearly there, just a mod or two was needed so here is the final regex.

/\s|(?<=\w)(?=[.,:;!?)])|(?<=[.,"!()?\x{201C}])/u

Note: lookarounds just match something, they consume zero characters, hence the 'zero width assertion' term you may come across. If we didn't use lookarounds, the regex engine would match on that character and remove it from the matches. The pipe meta character | is an OR, in regex terms an alternate pattern.

\s - match a white space character. We don't need this in a lookaround as we want to remove it anyway.

(?<=\w)(?=[.,:;!?)]) - OR match with a positive lookbehind for a word character \w followed by a positive lookahead of any of the following punctuation characters .,:;!?).

(?<=[.,"!()?\x{201C}]) - OR match with a positive lookbehind for the following punctuation characters .,"!()?\x{201C}. The \x{201C} is a left double quotation mark (unicode double byte character).

u - modifier to allow utf-8 characters like \x{201C}

In your original regex the (?=[^ ]) at the end is redundant so I removed it. It could have been written (?!\s) which is the same, a negative lookahead for a single white space character.

So you'd use preg_split() like:

$return = preg_split('/\s|(?<=\w)(?=[.,:;!?)])|(?<=[.,"!()?\x{201C}])/u', $text, -1, PREG_SPLIT_NO_EMPTY)

regexp to break text into array by spaces, punctuation marks, linebreaks

Answers (2)

Related Questions