kiki
kiki

Reputation: 25

Regex alphanumeric, punctuation as 1 word

I am matching each words against identical paragraph words.

Update 1: I realise just accepting punctation you need does not solve this issue.

Example 'hello-' and 'hello' , are consider seperate word.

Is there a way to remove punctuation before and after word and stand alone punctuation? Only allow punctutation within word.

$string="_ - – hello’ hello' hello, hello- world. he,llo hello-world hello_world hel-lo-world hello9world"; 

The output should be

hello hello hello hello world he,llo hello-world hello_world hel-lo-world hello9world

Only word or punctuation within word

Update 2: If word only or punctuation within word, decimal number will have issue.

1.0 still ok, .1 as punctuation remove before and after, will become 1 instead of 0.1

Update 3: With accepting punctuation in word, Substrings start or end with a letter or a number will have issue. 20-year-old will become '20-' 'year-old'.

Thanks mickmackusa.

Upvotes: 1

Views: 451

Answers (1)

mickmackusa
mickmackusa

Reputation: 47874

Pattern: /[a-z\d]+(?:[-_’',.][a-z\d]+)*/iu (Pattern Demo)

This pattern demands that all matching substrings start with a letter or a number. The substrings may contain a punctuation character (any of the ones in the character class [-_’',.]) but it must be immediately followed by one or more letters or numbers. The * means zero or more of the preceding parenthetical expression, so substrings can be valid whether they contain a non-alpha-numeric character or not.

This pattern will not match a substring with two consecutive non-alpha-numeric characters as one match. For example: 20--what will not return 20--what, it will be 20 and what.

*if you want to allow ANY non-white-space character in the middle of the string, you can use this: /[a-z\d]+(?:\S[a-z\d]+)*/iu

The i flag allowd [a-z] to match uppercase occurrences as well.
The u flag allows unicode characters like .

PHP Code: (Demo)

$string="_ - – hel’lo’ hel'lo' .1 1.0 1. hello, hello- world. he,llo hello-world hello_world hel-lo-world hello9world -20 20- 20-year -20year- -20-year- 20-year-old 20-yearold 20year-old 20-year-old-old 20-20-year-20-old-";
echo preg_match_all("/[a-z\d]+(?:[-_’',.][a-z\d]+)*/iu",$string,$out)?implode(' ',$out[0]):'fail';

Output:

hel’lo hel'lo 1 1.0 1 hello hello world he,llo hello-world hello_world hel-lo-world hello9world 20 20 20-year 20year 20-year 20-year-old 20-yearold 20year-old 20-year-old-old 20-20-year-20-old

Upvotes: 1

Related Questions