Fitz
Fitz

Reputation: 17

Extracting src attribute

What I want to do:

This HTML code:

<img class="poster lazyload lazyloaded"
     data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     alt="Hitman"
     src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     data-loaded="true">

I want to extract the "data-src" or "src" (or every attribute contain the URL to the image) attribute value.

What I Tried:

Posters = soup.find("img")["src"]
print(Posters)

But this obviously returns all the values from every img tag, so every link is not related to the posters. Output:

https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG

With posters I mean (check this URL: https://www.themoviedb.org/search?&query=Hitman) the posters of films.

Summary

I want to extract the value inside an attribute, inside the class ".lazyloaded"

I hope is everything clear. Thanks.


Edit:

Explaination, where was the problem?

For everyone reading, Laurent's answer is the solution, the problem was the parsed HTML.

As we can see on my browser the class that contain the attribute that i was trying to scrape was inside the class "poster lazyload lazyloaded": HTML

but if we print the website.content:

   <img class="poster lazyload" 
        data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg"                                                                          
        data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 2x"
        alt="The Hitman&#x27;s Bodyguard Collection">

it's very very different.

Upvotes: 0

Views: 90

Answers (1)

Laurent LAPORTE
Laurent LAPORTE

Reputation: 23012

You can try to filter by class:

posters  = soup.find_all("img", {"class": "lazyloaded"})

for poster in posters:
    print(poster["src"])

See the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

edit: more explanation

Say you have the following file demo.html:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Title</title>
</head>
<body>
<img class="logo" src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg">
<img class="poster lazyload lazyloaded"
     data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     alt="Hitman"
     src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
     srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
     data-loaded="true">
</body>
</html>

You can parse the "poster" images like this:

import io

from bs4 import BeautifulSoup

with io.open("demo.html", encoding="utf8") as fd:
    soup = BeautifulSoup(fd.read(), features="html.parser")

posters = soup.find_all("img", {"class": "lazyloaded"})

for poster in posters:
    print(poster["src"])

You get:

https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg

Upvotes: 2

Related Questions