Reputation: 5863
I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:
data=soup.find_all('h2')
which yields me something like:
<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>
which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):
t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))
I get:
AttributeError: 'NoneType' object has no attribute 'group'
which I find strange - because, when I do on the python interpretter, something like:
k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>"""
and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.
Upvotes: 1
Views: 1240
Reputation: 101959
You should try to see the repr
of the string:
>>> a=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>"""
>>> print repr(a)
'<h2><a href=\\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\\">more-accurate-data</a></h2>'
And the regex works with this representation:
>>> regex = re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""")
>>> regex.match(a)
<_sre.SRE_Match object at 0x20fbf30>
The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the repr
of the strings involved to avoid things like this.
Upvotes: 2