Reputation: 159
This is, I fear, frighteningly simple, but I can't make it work (and I can't find the answer through a search). I am scraping a website for all words in italics (the ones I want are in groups of two words--they are binomial scientific names), but I don't want any numbers returned.
The regex I used : <i>(.+?)</i>
worked great but it pulled the numbers. I thought using \D
would work, but it didn't. What am I doing wrong?
Upvotes: 2
Views: 87
Reputation: 70732
Yes, I basically want to strip integers from any string inside the tags.
Python's re.findall
looping through your matches replacing number characters should work for you.
pattern = re.compile(r'(?<=<i>).*?(?=</i>)')
for names in re.findall(pattern, htmltext):
print re.sub(r'[0-9]', '', names)
To find the matches that do not contain numbers:
matches = re.findall(r'(?<=<i>)[^0-9]*(?=</i>)', htmltext)
print matches
Upvotes: 2