Reputation: 15
I would like to parse all link extensions in the below text using re.findall
to store my result in an array.
my_text = <td class="1stclass"> <div class="2ndclass"> <div class="2ndclass__img"><a href="link_extension_1.php"><div class="3rdclass"><img alt="hello" border="0" class="image" height="42" src="https://yoyo.jpg"/></div></a></div> <div class="2ndclass__content"><p><a href="link_extension_1.php"></a> </p> </div> <div class="2ndclass__compare"><label for="comparer2" style="font-size:11px;"><input class="js__media__compare__input" id="comparer2" name="comparer" type="checkbox" value="89453"/> Comparer</label></div> </div></td>
<td class="1stclass"> <div class="2ndclass"> <div class="2ndclass__img"><a href="link_extension_2.php"><div class="3rdclass"><img alt="hello" border="0" class="image" height="42" src="https://yoyo.jpg"/></div></a></div> <div class="2ndclass__content"><p><a href="link_extension_2.php"></a> </p> </div> <div class="2ndclass__compare"><label for="comparer2" style="font-size:11px;"><input class="js__media__compare__input" id="comparer2" name="comparer" type="checkbox" value="89453"/> Comparer</label></div> </div></td>
<td class="1stclass"> <div class="2ndclass"> <div class="2ndclass__img"><a href="link_extension_3.php"><div class="3rdclass"><img alt="hello" border="0" class="image" height="42" src="https://yoyo.jpg"/></div></a></div> <div class="2ndclass__content"><p><a href="link_extension_3.php"></a> </p> </div> <div class="2ndclass__compare"><label for="comparer2" style="font-size:11px;"><input class="js__media__compare__input" id="comparer2" name="comparer" type="checkbox" value="89453"/> Comparer</label></div> </div></td>
I'm trying to get this result :
["link_extension_1.php","link_extension_2.php","link_extension_3.php"]
I tried that :
re.findall(r'\<div class="2ndclass__img"><a href="(.*?)\"><div', my_text)
but got that error :
SyntaxError: unexpected EOF while parsing Thanks Max
Upvotes: 0
Views: 101
Reputation: 12015
Your regex works fine for me
>>> re.findall(r'\<div class="2ndclass__img"><a href="(.*?)\"><div', my_text)
['link_extension_1.php', 'link_extension_2.php', 'link_extension_3.php']
But avoid parsing html data using regex and use some tool designed for parsing html data, something like BeatifulSoup
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(my_text, "html.parser")
>>> [div.find('a').get('href') for div in soup.find_all('div', {'class': "2ndclass__img"})]
['link_extension_1.php', 'link_extension_2.php', 'link_extension_3.php']
Upvotes: 1