kenorb
kenorb

Reputation: 166429

How to remove all alphanumeric words from the text?

I'm trying to write regular expression in PHP which simply would remove alphanumeric words (words which contains digits), but not numbers which have punctuation and similar special characters (e.g. prices, phone numbers, etc.).

Words which should be removed:

1st, H20, 2nd, O2, 3rd, NUMB3RS, Rüthen1, Wrocław2

Words which shouldn't be removed:

0, 5.5, 10, $100, £65, +44, (20), 123, ext:124, 4.4-BSD,

Here is the code so far:

$text = 'To remove: 1st H20; 2nd O2; 3rd NUMB3RS; To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD';
$pattern = '/\b\w*\d\w*\b-?/';
echo $text, preg_replace($pattern, " ", $text);

However it removes all words including digits, prices and phone.

I've also tried so far the following patterns:

/(\\s+\\w{1,2}(?=\\W+))|(\\s+[a-zA-Z0-9_-]+\\d+)/ # Removes digits, etc.
/[^(\w|\d|\'|\"|\.|\!|\?|;|,|\\|\/|\-|:|\&|@)]+/ # Doesn't work.
/(\\s+\\w{1,2}(?=\\W+))|(\\s+[a-zA-Z0-9_-]+\\d+)/ # Removes too much.
/[^\p{L}\p{N}-]+/u                       # It removes only special characters.
/(^[\D]+\s|\s[\D]+\s|\s[\D]+$|^[\D]+$)+/ # Removes words.
/ ?\b[^ ]*[0-9][^ ]*\b/i                 # Almost, but removes digits, price, phone.
/\s+[\w-]*\d[\w-]*|[\w-]*\d[\w-]*\s*/    # Almost, but removes digits, price, phone.
/\b\w*\d\w*\b-?/                         # Almost, but removes digits, price, phone.
/[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*/       # Almost, but removes too much.

which I've found across SO (most of them are usually too specific) and other sites which suppose to remove words with digits, but they're not.

How I can write a simple regular expression which can remove these words without touching other things?

Sample text:

To remove: 1st H20; 2nd O2; 3rd NUMB3RS;

To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD

Expected output:

To remove: ; ; ; To leave: Digits: -2 0 5.5 10, Prices: $100 or £65, Phone: +44 (20) 123 ext:124, 4.4-BSD

Upvotes: 0

Views: 287

Answers (2)

Regular Jo
Regular Jo

Reputation: 5510

How about replacing \b(?=[a-z]+\d|[a-z]*\d+[a-z]+)\w*\b\s* with nothing?

Demo: https://regex101.com/r/jA2fW3/1

Pattern code:

$pattern = '/\b(?=[a-z]+\d|[a-z]*\d+[a-z]+)\w*\b\s*/i';

To match alphanumeric words containing foreign/accented letters, use the following pattern:

$pattern = '/\b(?=[\pL]+\d|[\pL]*\d+[\pL]+)[\pL\w]*\b\s*/i';

Demo: https://regex101.com/r/jA2fW3/3

Upvotes: 4

hwnd
hwnd

Reputation: 70732

You can modify your regular expression as follows for the desired output.

$text = preg_replace('/\b(?:[a-z]+\d+[a-z]*|\d+[a-z]+)\b/i', '', $text);

To match any kind of letter from any language, use the Unicode property \p{L}:

$text = preg_replace('/\b(?:\pL+\d+\pL*|\d+\pL+)\b/u', '', $text);

Upvotes: 3

Related Questions