PythonFisher
PythonFisher

Reputation: 159

Omitting Numbers with regex

This is, I fear, frighteningly simple, but I can't make it work (and I can't find the answer through a search). I am scraping a website for all words in italics (the ones I want are in groups of two words--they are binomial scientific names), but I don't want any numbers returned.

The regex I used : <i>(.+?)</i>

worked great but it pulled the numbers. I thought using \D would work, but it didn't. What am I doing wrong?

Upvotes: 2

Views: 87

Answers (2)

hwnd
hwnd

Reputation: 70732

Yes, I basically want to strip integers from any string inside the tags.

Python's re.findall looping through your matches replacing number characters should work for you.

pattern = re.compile(r'(?<=<i>).*?(?=</i>)')

for names in re.findall(pattern, htmltext):
    print re.sub(r'[0-9]', '', names)

To find the matches that do not contain numbers:

matches = re.findall(r'(?<=<i>)[^0-9]*(?=</i>)', htmltext)
print matches

Upvotes: 2

Kerem Zaman
Kerem Zaman

Reputation: 539

I think it works. You can try so. +^[0-9]

Upvotes: -1

Related Questions