adrCoder
adrCoder

Reputation: 3275

Python - regex to keep only words with textual characters

I want to have a regex in my Python program to keep only words that contain alphabetical text characters (i.e. no special characters such as dots, commas, :, ! etc.)

I am using this code to get the words from a text file:

find_words = re.compile(r'\w+').findall

The problem with this regular expression is that for input like this:

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 U6u1HjX9A2VnveGmx3CbhhgTr7o+NJWodWNJQjg1aSLDkLnJwruLq9hBBcqxouFq
 NY7xtb92dCTfvEjdmkDrUw==

0001393311-11-000011.txt : 20110301
0001393311-11-000011.hdr.sgml : 20110301
20110301164350
ACCESSION NUMBER:       0001393311-11-000011
CONFORMED SUBMISSION TYPE:  10-K
PUBLIC DOCUMENT COUNT:      16
CONFORMED PERIOD OF REPORT: 20101231
FILED AS OF DATE:       20110301
DATE AS OF CHANGE:      20110301

FILER:

I get output like this:

begin
privacy
enhanced
message
proc
type
2001
mic
clear
originator
name
webmaster
www
sec
gov
originator
key
asymmetric
mfgwcgyevqgbaqicaf8dsgawrwjaw2snkk9avtbzyzmr6agjlwyk3xmzv3dtinen
twsm7vrzladbmyqaionwg5sdw3p6oam5d3tdezxmm7z1t
b
twidaqab
mic
info
rsa
md5
rsa
u6u1hjx9a2vnvegmx3cbhhgtr7o
njwodwnjqjg1asldklnjwrulq9hbbcqxoufq
ny7xtb92dctfvejdmkdruw
0001393311
11
000011
txt
20110301
0001393311
11
000011
hdr
sgml

which is not what I want because

A) it does not keep words that I want it to keep such as "Accession", "Number" etc., and it also keeps stuff like mfgwcgyevqgbaqicaf8dsgawrwjaw2snkk9avtbzyzmr6agjlwyk3xmzv3dtinen which I don't want to keep because of the numbers in the word, and it also keeps 0001393311 etc. which I don't want to keep.

Any ideas on how to get the words that I want ? (i.e. to contain only alphabetical characters).

Upvotes: 3

Views: 13233

Answers (4)

Avinash Raj
Avinash Raj

Reputation: 174696

Here you actually need to use a negative look-behind assertion.

(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
  • (?<!\S)[A-Za-z]+(?!\S) matches the exact word which must contain alphabets.

  • | OR

  • (?<!\S)[A-Za-z]+(?=:(?!\S)) One or more word characters which must be followed by a colon which in-turn not followed by a non-space character. You could use (?=:\s) pattern instead of (?=:(?!\S)) also.

DEMO

Upvotes: 3

Toto
Toto

Reputation: 91375

I'd use:

(?<=^|\P{L})\p{L}+(?=\P{L}|$)

or, to avoid variable lookbehind:

(?<!\p{L})\p{L}+(?=\P{L}|$)

where:

\p{L} means any letter (unicode)
\P{L} is the opposite of \p{L} ie. NOT a letter

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

If you need to extract words separated with non-letters, you can use \b[a-zA-Z]+\b regex (outputs Originator and Name from Originator-Name:).

If you want to limit to the entities that are most likely to be words, I'd suggest something like:

(?<![.-])\b([a-z]{2,}|[A-Z]{1}[a-z]+|[A-Z]{2,})\b(?!\.|@|\-)

See here. This regex will limit the number of matches.

Upvotes: 0

Shan-x
Shan-x

Reputation: 1176

re.match("^[A-Za-z]*$", string):

Upvotes: 3

Related Questions