Luna
Luna

Reputation: 45

Isolate SRC attribute from soup return in python

I am using Python3 with BeautifulSoup to get a certain div from a webpage. My end goal is to get the img src's url from within this div so I can pass it to pytesseract to get the text off the image.

The img doesn't have any classes or unique identifiers so I am not sure how to use BeautifulSoup to get just this image every time. There are several other images and their order changes from day to day. So instead, I just got the entire div that surrounds the image. The div information doesn't change and is unique, so my code looks like this:

weather_today = soup.find("div", {"id": "weather_today_content"})

thus my script currently returns the following:

<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>

Now I just need to figure out how to pull just the src into a string so I can then pass it to pytesseract to download and use ocr to pull further information.

I am unfamiliar with regex but have been told this is the best method. Any assistance would be greatly appreciated. Thank you.

Upvotes: 1

Views: 89

Answers (2)

Dan-Dev
Dan-Dev

Reputation: 9430

Find the 'img' element, in the 'div' element you found, then read the attribute 'src' from it.

from bs4 import BeautifulSoup

html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""

soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])

Outputs:

/database/img/weather_today.jpg?ver=2018-08-01

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can use CSS selector, that is built within BeautifulSoup (methods select() and select_one()):

data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')    
print(soup.select_one('div#weather_today_content img')['src'])

Prints:

/database/img/weather_today.jpg?ver=2018-08-01

The selector div#weather_today_content img means select <div> with id=weather_today_content and withing this <div> select an <img>.

Upvotes: 1

Related Questions