station
station

Reputation: 7145

How to find a specific tag using BeautifulSoup

I have the source HTML here http://pastebin.com/rxK0mnVj . I want to check the source to contain blz-src attribute in the Image tag, and check for src to not contain data uri and then return true or false.

For instance,

<img src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAQAICRAEAOw==" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>

should return False as data-blzsrc attribute is present but the src attribute contains data:

but this ,

<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>

should return True as it contains data-blzsrc attribute and the src does not contain data:

How to achieve this in BeautifulSoup.

Upvotes: 0

Views: 2639

Answers (2)

alecxe
alecxe

Reputation: 474161

If you want to find all img tags and test them, use find_all() and check the attributes, example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('index.html'))

def check_img(img):
    return 'data-blzsrc' in img.attrs and 'data' not in img.get('src', '')

for img in soup.find_all('img'):
    print img, check_img(img)

If you want to filter out the images that fit you criteria, you can pass in an attrs argument to find_all() providing a dictionary. Set data-blzsrc to True to enforce it's existence, use a function to check that the value of src does not contain data:

for img in soup.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: x and 'data' not in x}):
    print img

Upvotes: 1

xecgr
xecgr

Reputation: 5193

Try to find all images, and check if desired attr exists and check src attribute content. Have a look to this script:

from bs4 import BeautifulSoup
html = """
<img src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAQAICRAEAOw==" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a>
"""

soup = BeautifulSoup(html)
for img in soup.findAll('img'):
    #here is your desired conditions
    if img.has_attr('data-blzsrc') and not img.attrs.get('src','').startswith('data:'):
        print img

It prints the desired img node

<img alt="StrawberryNET" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" height="60" src="http://images.akam.net/img1.jpg" width="324"/>

Upvotes: 0

Related Questions