Steffi Keran Rani J
Steffi Keran Rani J

Reputation: 4093

Error while Extracting Link from webpage using Python 3

Let us consider the following:

<div class="more reviewdata">

<a onclick="bindreviewcontent('1660651',this,false,'I found this review of Star Health Insurance pretty useful',925075287,'.jpg','I found this review of Star Health Insurance pretty useful %23WriteShareWin','http://www.mouthshut.com/review/Star-Health-Insurance-review-toqnmqrlrrm','Star Health Insurance',' 2/5');" style="cursor:pointer">Read More</a>

</div>

From something like the above, I wanted to extract the http link alone as follows:

http://www.mouthshut.com/review/Star-Health-Insurance-review-toqnmqrlrrm

In order to achieve this, I wrote a code using BeautifulSoup and regular expression in Python. The code is as follows:

import urllib.request
import re

from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://www.mouthshut.com/product-reviews/Star-Health-Insurance-reviews-925075287').read()

soup = BeautifulSoup(page, "html.parser")

required = soup.find_all("div", {"class": "more reviewdata"})

for link in re.findall('http://www.mouthshut.com/review/Star-Health-Insurance-review-[a-z]*', required):
   print(link)

On execution, the program threw an error as follows:

Traceback (most recent call last):

File "E:/beautifulSoup20April2.py", line 11, in <module>

for link in re.findall('http://www.mouthshut.com/review/Star-Health-Insurance-review-[a-z]*', required):

File "C:\Program Files (x86)\Python35-32\lib\re.py", line 213, in findall
return _compile(pattern, flags).findall(string)

TypeError: expected string or bytes-like object

Can someone suggest what should be done to extract the url alone without any error?

Upvotes: 1

Views: 52

Answers (1)

Pedro Lobito
Pedro Lobito

Reputation: 98861

First you need to loop required, second you're trying to use a regex on an object <class 'bs4.element.Tag'> (python was complaining about this), then you need to extract the html from the bs4 element, which can be done with prettify()

here's a working version:

import urllib.request
import re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://www.mouthshut.com/product-reviews/Star-Health-Insurance-reviews-925075287').read()
soup = BeautifulSoup(page, "html.parser")
required = soup.find_all("div", {"class": "more reviewdata"})
for div in required:
   for link in re.findall(r'http://www\.mouthshut\.com/review/Star-Health-Insurance-review-[a-z]*', div.prettify()):
      print(link)

Output:

http://www.mouthshut.com/review/Star-Health-Insurance-review-ommmnmpmqtm
http://www.mouthshut.com/review/Star-Health-Insurance-review-rmqulrolqtm
http://www.mouthshut.com/review/Star-Health-Insurance-review-ooqrupoootm
http://www.mouthshut.com/review/Star-Health-Insurance-review-rlrnnuslotm
http://www.mouthshut.com/review/Star-Health-Insurance-review-umqsquttntm
...

Upvotes: 1

Related Questions