Reputation: 313
I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
Upvotes: 4
Views: 3933
Reputation: 626896
To match alphanumeric strings or only letter words you may use the following pattern with re
:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*
- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits|
- or[^\W\d_]+
- either any 1+ Unicode lettersNOTE It is equivalent to \d*[^\W\d_][^\W_]*
pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
Upvotes: 2
Reputation: 825
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W]
is typical double negated construct. Here you want to match any character which is not alphanumeric or _
(\W
is the negation of \w
, which matches any alphanumeric character plus _
- common equivalent [a-zA-Z0-9_]
).
It reveals useful here to compose:
[^\W_]
matches any character which is not non-[alphanumeric or _
] and is not _
.[^\W\d_]
matches any character which is not non-[alphanumeric or _
] and is not digit (\d
) and is not _
.Some further reading here.
Edit:
When _
is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*
.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Upvotes: 2
Reputation: 381
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.
Upvotes: 0