Reputation: 7145
I have the source HTML here http://pastebin.com/rxK0mnVj . I want to check the source to contain blz-src attribute in the Image tag, and check for src to not contain data uri and then return true or false.
For instance,
<img src="" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
should return False
as data-blzsrc
attribute is present but the src
attribute contains data:
but this ,
<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
should return True
as it contains data-blzsrc
attribute and the src
does not contain data:
How to achieve this in BeautifulSoup.
Upvotes: 0
Views: 2639
Reputation: 474161
If you want to find all img
tags and test them, use find_all()
and check the attributes, example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('index.html'))
def check_img(img):
return 'data-blzsrc' in img.attrs and 'data' not in img.get('src', '')
for img in soup.find_all('img'):
print img, check_img(img)
If you want to filter out the images that fit you criteria, you can pass in an attrs
argument to find_all()
providing a dictionary. Set data-blzsrc
to True
to enforce it's existence, use a function to check that the value of src
does not contain data
:
for img in soup.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: x and 'data' not in x}):
print img
Upvotes: 1
Reputation: 5193
Try to find all images, and check if desired attr exists and check src attribute content. Have a look to this script:
from bs4 import BeautifulSoup
html = """
<img src="" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
"""
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
#here is your desired conditions
if img.has_attr('data-blzsrc') and not img.attrs.get('src','').startswith('data:'):
print img
It prints the desired img node
<img alt="StrawberryNET" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" height="60" src="http://images.akam.net/img1.jpg" width="324"/>
Upvotes: 0