Reputation: 333
I am trying to extract letters from a string, which do not follow, or are not followed directly by a number.
Here's an example string:
string = "ts0060_LOD-70234_lr2_billboards_rgba_over_s3d_lf_v5_2Kdciufa_lnh"
This is what I have so far:
re.findall(r"[a-z]+", string.lower())
which gives this result:
['ts', 'lod', 'lr', 'billboards', 'rgba', 'over', 's', 'd', 'lf', 'v', 'kdciufa', 'lnh']
... but the result I am looking for is something more like this:
['lod', 'billboards', 'rgba', 'over', 'lf', 'lnh']
Is there a way of achieving this using regular expressions?
Many thanks,
Upvotes: 4
Views: 4505
Reputation: 76194
An alternative to using findall
is to split the string into individual words, and then filter out any words containing non-alphabetical characters.
import re
string = "ts0060_LOD-70234_lr2_billboards_rgba_over_s3d_lf_v5_2Kdciufa_lnh"
#split on non-alphanumeric characters
words = re.split("[^a-z0-9]", string.lower())
print "words:", words
filtered_words = filter(str.isalpha, words)
print "filtered words:", filtered_words
Result:
words: ['ts0060', 'lod', '70234', 'lr2', 'billboards', 'rgba', 'over', 's3d', 'lf', 'v5', '2kdciufa', 'lnh']
filtered words: ['lod', 'billboards', 'rgba', 'over', 'lf', 'lnh']
Upvotes: 2
Reputation: 1122142
Use negative look-arounds:
re.findall(r"(?<![\da-z])[a-z]+(?![\da-z])", string.lower())
This matches lower-case letters that are not immediately preceded or followed by more letters or digits.
Demo:
>>> import re
>>> string = "ts0060_LOD-70234_lr2_billboards_rgba_over_s3d_lf_v5_2Kdciufa_lnh"
>>> re.findall(r"(?<![\da-z])[a-z]+(?![\da-z])", string.lower())
['lod', 'billboards', 'rgba', 'over', 'lf', 'lnh']
Upvotes: 8