Reputation: 465
for imgsrc in Soup.findAll('img', {'class': 'sizedProdImage'}):
if imgsrc:
imgsrc = imgsrc
else:
imgsrc = "ERROR"
patImgSrc = re.compile('src="(.*)".*/>')
findPatImgSrc = re.findall(patImgSrc, imgsrc)
print findPatImgSrc
'''
<img height="72" name="proimg" id="image" class="sizedProdImage" src="http://imagelocation" />
This is what I am trying to extract from and I am getting:
findimgsrcPat = re.findall(imgsrcPat, imgsrc)
File "C:\Python27\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
'''
Upvotes: 20
Views: 37424
Reputation: 5152
In my example, the htmlText contains the img tag but it can be used for a URL too. See my answer here
from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
print image['src']
Upvotes: 0
Reputation: 36767
You're passing beautifulsoup node to re.findall. You have to convert it to string. Try:
findPatImgSrc = re.findall(patImgSrc, str(imgsrc))
Better yet, use the tools beautifulsoup provides:
[x['src'] for x in soup.findAll('img', {'class': 'sizedProdImage'})]
gives you a list of all src attributes of img tags of class 'sizedProdImage'.
Upvotes: 31
Reputation: 30947
You're creating an re
object, then passing it into re.findall
which expects a string as the first argument:
patImgSrc = re.compile('src="(.*)".*/>')
findPatImgSrc = re.findall(patImgSrc, imgsrc)
Instead, use the .findall
method of the patImgSrc object you just created:
patImgSrc = re.compile('src="(.*)".*/>')
findPatImgSrc = patImgSrc.findall(imgsrc)
Upvotes: 0