Reputation: 33
I have a large text and the aim is to select all 10-character strings for which the first character is a letter and the last character is a digit.
I am a python rookie and what I managed to achieve is to find all 10-character strings:
ten_char = re.findall(r"\D(\w{10})\D", pdfdoc)
Question is how can I put together my other conditions: apart from a 10-character string, I am looking for one where the first character is a letter and the last character is a digit.
Suggestions appreciated!
Upvotes: 3
Views: 442
Reputation: 33
thank you very much for a great discussion and interesting suggestions. Very first post on stack overflow, but wow...what a community you are!
In fact, using:
r'\b([a-zA-Z]\S{8}\d)'
solved my problem very nicely. Really appreciated all your comments.
Upvotes: 0
Reputation: 103864
If I understand it, do:
r'\b([a-zA-Z]\S{8}\d)\b'
Python demo:
>>> import re
>>> txt="""\
... Should match:
... a123456789 aA34567s89 zzzzzzzer9
...
... Not match:
... 1123456789 aA34567s8a zzzzzzer9 zzzxzzzze99"""
>>> re.findall(r'\b([a-zA-Z]\S{8}\d)\b', txt)
['a123456789', 'aA34567s89', 'zzzzzzzer9']
Upvotes: 1
Reputation: 9381
([a-z].{8}[0-9])
Will ask for 1 alphabetical char, 8 other character and finally 1 number.
JS Demo
var re = /([a-z].{8}[0-9])/gi;
var str = 'Aasdf23423423423423423b423423423423423';
var m;
while ((m = re.exec(str)) !== null) {
if (m.index === re.lastIndex) {
re.lastIndex++;
}
console.log(m[0]);
}
https://regex101.com/r/gI8jZ4/1
Upvotes: 2
Reputation: 6348
I wouldn't use regex for this. Regular string manipulation is more clear in my opinion (though I haven't tested the following code).
def get_useful_words(filename):
with open(filename, 'r') as file:
for line in file:
for word in line.split():
if len(word) == 10 and word[0].isalpha() and word[-1].isdigit():
yield word
for useful_word in get_useful_words('tmp.txt'):
print(useful_word)
Upvotes: 0