Reputation: 3275
I want to have a regex in my Python program to keep only words that contain alphabetical text characters (i.e. no special characters such as dots, commas, :, ! etc.)
I am using this code to get the words from a text file:
find_words = re.compile(r'\w+').findall
The problem with this regular expression is that for input like this:
-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
U6u1HjX9A2VnveGmx3CbhhgTr7o+NJWodWNJQjg1aSLDkLnJwruLq9hBBcqxouFq
NY7xtb92dCTfvEjdmkDrUw==
0001393311-11-000011.txt : 20110301
0001393311-11-000011.hdr.sgml : 20110301
20110301164350
ACCESSION NUMBER: 0001393311-11-000011
CONFORMED SUBMISSION TYPE: 10-K
PUBLIC DOCUMENT COUNT: 16
CONFORMED PERIOD OF REPORT: 20101231
FILED AS OF DATE: 20110301
DATE AS OF CHANGE: 20110301
FILER:
I get output like this:
begin
privacy
enhanced
message
proc
type
2001
mic
clear
originator
name
webmaster
www
sec
gov
originator
key
asymmetric
mfgwcgyevqgbaqicaf8dsgawrwjaw2snkk9avtbzyzmr6agjlwyk3xmzv3dtinen
twsm7vrzladbmyqaionwg5sdw3p6oam5d3tdezxmm7z1t
b
twidaqab
mic
info
rsa
md5
rsa
u6u1hjx9a2vnvegmx3cbhhgtr7o
njwodwnjqjg1asldklnjwrulq9hbbcqxoufq
ny7xtb92dctfvejdmkdruw
0001393311
11
000011
txt
20110301
0001393311
11
000011
hdr
sgml
which is not what I want because
A) it does not keep words that I want it to keep such as "Accession", "Number"
etc., and it also keeps stuff like mfgwcgyevqgbaqicaf8dsgawrwjaw2snkk9avtbzyzmr6agjlwyk3xmzv3dtinen
which I don't want to keep because of the numbers in the word, and it also keeps 0001393311
etc. which I don't want to keep.
Any ideas on how to get the words
that I want ? (i.e. to contain only alphabetical characters).
Upvotes: 3
Views: 13233
Reputation: 174696
Here you actually need to use a negative look-behind assertion.
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
(?<!\S)[A-Za-z]+(?!\S)
matches the exact word which must contain alphabets.
|
OR
(?<!\S)[A-Za-z]+(?=:(?!\S))
One or more word characters which must be followed by a colon which in-turn not followed by a non-space character. You could use (?=:\s)
pattern instead of (?=:(?!\S))
also.
Upvotes: 3
Reputation: 91375
I'd use:
(?<=^|\P{L})\p{L}+(?=\P{L}|$)
or, to avoid variable lookbehind:
(?<!\p{L})\p{L}+(?=\P{L}|$)
where:
\p{L} means any letter (unicode)
\P{L} is the opposite of \p{L} ie. NOT a letter
Upvotes: 0
Reputation: 626690
If you need to extract words separated with non-letters, you can use \b[a-zA-Z]+\b
regex (outputs Originator
and Name
from Originator-Name:
).
If you want to limit to the entities that are most likely to be words, I'd suggest something like:
(?<![.-])\b([a-z]{2,}|[A-Z]{1}[a-z]+|[A-Z]{2,})\b(?!\.|@|\-)
See here. This regex will limit the number of matches.
Upvotes: 0