Reputation: 17
This HTML code:
<img class="poster lazyload lazyloaded"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
alt="Hitman"
src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
data-loaded="true">
I want to extract the "data-src" or "src" (or every attribute contain the URL to the image) attribute value.
Posters = soup.find("img")["src"]
print(Posters)
But this obviously returns all the values from every img tag, so every link is not related to the posters. Output:
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
With posters I mean (check this URL: https://www.themoviedb.org/search?&query=Hitman
) the posters of films.
I want to extract the value inside an attribute, inside the class ".lazyloaded"
I hope is everything clear. Thanks.
Edit:
For everyone reading, Laurent's answer is the solution, the problem was the parsed HTML.
As we can see on my browser the class that contain the attribute that i was trying to scrape was inside the class "poster lazyload lazyloaded":
but if we print the website.content:
<img class="poster lazyload"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 2x"
alt="The Hitman's Bodyguard Collection">
it's very very different.
Upvotes: 0
Views: 90
Reputation: 23012
You can try to filter by class
:
posters = soup.find_all("img", {"class": "lazyloaded"})
for poster in posters:
print(poster["src"])
See the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
edit: more explanation
Say you have the following file demo.html
:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<img class="logo" src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg">
<img class="poster lazyload lazyloaded"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
alt="Hitman"
src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
data-loaded="true">
</body>
</html>
You can parse the "poster" images like this:
import io
from bs4 import BeautifulSoup
with io.open("demo.html", encoding="utf8") as fd:
soup = BeautifulSoup(fd.read(), features="html.parser")
posters = soup.find_all("img", {"class": "lazyloaded"})
for poster in posters:
print(poster["src"])
You get:
https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg
Upvotes: 2