Why isn't this regexp working

Question

I have a source code of a webpage formatted like this:


Turkish


The.Mist[2007]DvDrip[Eng]-aXXo


Vietnamese


The.Mist.2007.720p.Bluray.x264.YIFY

As you can see, there are either spans with the class of "l r positive-icon" or "l r neutral-icon". I want to get only the languages, so everything between the span with any class. I use this regexp but it gives me an empty list:

allLanguages = re.findall('\s(.*)\s', allLanguagesTags)

allLanguagesTags contains the source code shown above. Can anybody give me a hint?

Martijn Pieters · Accepted Answer

Don't use regular expressions. Use an actual HTML parser. I recommend you use BeautifulSoup instead:

from bs4 import BeautifulSoup

soup = BeautifulSoup(yourhtml)
languages = [s.get_text().strip() for s in soup.find_all('span', class_=True)]

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... 
... Turkish
... 
... 
... The.Mist[2007]DvDrip[Eng]-aXXo
... 
... 
... Vietnamese
... 
... 
... The.Mist.2007.720p.Bluray.x264.YIFY 
... 
... ''')
>>> [s.get_text().strip() for s in soup.find_all('span', class_=True)]
[u'Turkish', u'Vietnamese']

Why isn't this regexp working

Answers (1)

Related Questions

Why isn&#39;t this regexp working

Answers (1)

Related Questions

Why isn't this regexp working