Reputation: 166429
I'm trying to write regular expression in PHP which simply would remove alphanumeric words (words which contains digits), but not numbers which have punctuation and similar special characters (e.g. prices, phone numbers, etc.).
Words which should be removed:
1st
,H20
,2nd
,O2
,3rd
,NUMB3RS
,Rüthen1
,Wrocław2
Words which shouldn't be removed:
0
,5.5
,10
,$100
,£65
,+44
,(20)
,123
,ext:124
,4.4-BSD
,
Here is the code so far:
$text = 'To remove: 1st H20; 2nd O2; 3rd NUMB3RS; To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD';
$pattern = '/\b\w*\d\w*\b-?/';
echo $text, preg_replace($pattern, " ", $text);
However it removes all words including digits, prices and phone.
I've also tried so far the following patterns:
/(\\s+\\w{1,2}(?=\\W+))|(\\s+[a-zA-Z0-9_-]+\\d+)/ # Removes digits, etc.
/[^(\w|\d|\'|\"|\.|\!|\?|;|,|\\|\/|\-|:|\&|@)]+/ # Doesn't work.
/(\\s+\\w{1,2}(?=\\W+))|(\\s+[a-zA-Z0-9_-]+\\d+)/ # Removes too much.
/[^\p{L}\p{N}-]+/u # It removes only special characters.
/(^[\D]+\s|\s[\D]+\s|\s[\D]+$|^[\D]+$)+/ # Removes words.
/ ?\b[^ ]*[0-9][^ ]*\b/i # Almost, but removes digits, price, phone.
/\s+[\w-]*\d[\w-]*|[\w-]*\d[\w-]*\s*/ # Almost, but removes digits, price, phone.
/\b\w*\d\w*\b-?/ # Almost, but removes digits, price, phone.
/[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*/ # Almost, but removes too much.
which I've found across SO (most of them are usually too specific) and other sites which suppose to remove words with digits, but they're not.
How I can write a simple regular expression which can remove these words without touching other things?
Sample text:
To remove:
1st
H20
;2nd
O2
;3rd
NUMB3RS
;To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD
Expected output:
To remove: ; ; ; To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD
Upvotes: 0
Views: 287
Reputation: 5510
How about replacing \b(?=[a-z]+\d|[a-z]*\d+[a-z]+)\w*\b\s*
with nothing?
Demo: https://regex101.com/r/jA2fW3/1
Pattern code:
$pattern = '/\b(?=[a-z]+\d|[a-z]*\d+[a-z]+)\w*\b\s*/i';
To match alphanumeric words containing foreign/accented letters, use the following pattern:
$pattern = '/\b(?=[\pL]+\d|[\pL]*\d+[\pL]+)[\pL\w]*\b\s*/i';
Demo: https://regex101.com/r/jA2fW3/3
Upvotes: 4
Reputation: 70732
You can modify your regular expression as follows for the desired output.
$text = preg_replace('/\b(?:[a-z]+\d+[a-z]*|\d+[a-z]+)\b/i', '', $text);
To match any kind of letter from any language, use the Unicode property \p{L}
:
$text = preg_replace('/\b(?:\pL+\d+\pL*|\d+\pL+)\b/u', '', $text);
Upvotes: 3