Reputation: 25
I am matching each words against identical paragraph words.
Update 1: I realise just accepting punctation you need does not solve this issue.
Example 'hello-' and 'hello' , are consider seperate word.
Is there a way to remove punctuation before and after word and stand alone punctuation? Only allow punctutation within word.
$string="_ - – hello’ hello' hello, hello- world. he,llo hello-world hello_world hel-lo-world hello9world";
The output should be
hello hello hello hello world he,llo hello-world hello_world hel-lo-world hello9world
Only word or punctuation within word
Update 2: If word only or punctuation within word, decimal number will have issue.
1.0 still ok, .1 as punctuation remove before and after, will become 1 instead of 0.1
Update 3: With accepting punctuation in word, Substrings start or end with a letter or a number will have issue. 20-year-old will become '20-' 'year-old'.
Thanks mickmackusa.
Upvotes: 1
Views: 451
Reputation: 47874
Pattern: /[a-z\d]+(?:[-_’',.][a-z\d]+)*/iu
(Pattern Demo)
This pattern demands that all matching substrings start with a letter or a number. The substrings may contain a punctuation character (any of the ones in the character class [-_’',.]
) but it must be immediately followed by one or more letters or numbers. The *
means zero or more of the preceding parenthetical expression, so substrings can be valid whether they contain a non-alpha-numeric character or not.
This pattern will not match a substring with two consecutive non-alpha-numeric characters as one match. For example: 20--what
will not return 20--what
, it will be 20
and what
.
*if you want to allow ANY non-white-space character in the middle of the string, you can use this:
/[a-z\d]+(?:\S[a-z\d]+)*/iu
The i
flag allowd [a-z]
to match uppercase occurrences as well.
The u
flag allows unicode characters like ’
.
PHP Code: (Demo)
$string="_ - – hel’lo’ hel'lo' .1 1.0 1. hello, hello- world. he,llo hello-world hello_world hel-lo-world hello9world -20 20- 20-year -20year- -20-year- 20-year-old 20-yearold 20year-old 20-year-old-old 20-20-year-20-old-";
echo preg_match_all("/[a-z\d]+(?:[-_’',.][a-z\d]+)*/iu",$string,$out)?implode(' ',$out[0]):'fail';
Output:
hel’lo hel'lo 1 1.0 1 hello hello world he,llo hello-world hello_world hel-lo-world hello9world 20 20 20-year 20year 20-year 20-year-old 20-yearold 20year-old 20-year-old-old 20-20-year-20-old
Upvotes: 1