c1377554
c1377554

Reputation: 313

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.

I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.

Here's what I did:

import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())

I get as output

win backdoor guid DNS lookup h lla

But I want to get:

win32 backdoor guid DNS lookup h0lla

demo: https://regex101.com/r/x4HrGo/1

Upvotes: 4

Views: 3933

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

To match alphanumeric strings or only letter words you may use the following pattern with re:

import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())

See the regex demo.

Details

  • (?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
  • | - or
  • [^\W\d_]+ - either any 1+ Unicode letters

NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.

Upvotes: 2

PJProudhon
PJProudhon

Reputation: 825

You could give a try to \b\d*[^\W\d_][^\W_]*\b

Decomposition:

\b       # word boundary
/d*      # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]*  # zero or more alphanumeric characters
\b       # word boundary

For beginners:

[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).

It reveals useful here to compose:

  • Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
  • Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.

Some further reading here.


Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.

Demo.

Upvotes: 2

tst
tst

Reputation: 381

Try this RegEx instead:

([A-Za-z]+(\d)*[A-Za-z]*)

You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Upvotes: 0

Related Questions