user12382982
user12382982

Reputation:

BeautifulSoup extract the value without class in Python

I want to extract data using BeautifulSoup in Python.

My document :

<div class="listing-item" data-id="309531" data-score="0">

  <div class="thumb">
    <a href="https://res.cloudinary.com/">

      <div style="background-image:url(https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2292,y_50/co_rgb:FFFFFF,l_text:oswald_100_bold_letter_spacing_5:01,y_-107/c_fit,w_200/abu-dhabi-plate_private-car_classic);"></div>
    </a>
  </div>
</div>

Here I want to get the background image URL from

<div style="background-image:url(https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2292,y_50/co_rgb:FFFFFF,l_text:oswald_100_bold_letter_spacing_5:01,y_-107/c_fit,w_200/abu-dhabi-plate_private-car_classic);"></div>

My Code :

from textwrap import shorten
from bs4 import BeautifulSoup
from urllib.parse import parse_qsl, urljoin, urlparse
import requests

url = 'https://uae.dubizzle.com/motors/number-plates/?page={}'

print('{:^50} {:^15} {:^25} '.format('Title', 'Pice', 'Date'))

for page in range(0, 40):   # <--- Increase to number pages you want
    response = requests.get(url.format(page))
    soup = BeautifulSoup(response.text, 'lxml')

    for title, price, date, thumb  in zip(soup.select('.listing-item .title'),
                            soup.select('.listing-item .price'),
                            soup.select('.listing-item .date'),
                            soup.select('.listing-item .thumb')):

        print('{:50} {:<25} {:<15}'.format(shorten(title.get_text().strip(), 50), price.get_text().strip(), thumb.get_text().strip()))

how can I get the background image URL from the document?

Upvotes: 1

Views: 112

Answers (2)

KunduK
KunduK

Reputation: 33384

You need to use the find_next('div') to get the div element and then get the style attribute.Use regular expression to get the Image Url.

Try Below code.

from textwrap import shorten
from bs4 import BeautifulSoup
import requests
import re

url = 'https://uae.dubizzle.com/motors/number-plates/?page={}'

print('{:^50} {:^15} {:^25} '.format('Title', 'Pice', 'Date'))

for page in range(0, 40):   # <--- Increase to number pages you want
    response = requests.get(url.format(page))
    soup = BeautifulSoup(response.text, 'lxml')

    for title, price, date, thumb  in zip(soup.select('.listing-item .title'),
                            soup.select('.listing-item .price'),
                            soup.select('.listing-item .date'),
                            soup.select('.listing-item .thumb')):


        print('{:50} {:<25} {:<15}'.format(shorten(title.get_text().strip(), 50), price.get_text().strip(), re.search("https?:\/\/[^\s]+[^);]", thumb.find_next("div")['style']).group(0)))

Here is some output on console:

G91911 - Excellent for PORSCHE                     AED 59,000                https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:88887,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:J,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
R 199                                              AED 49,000                https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2122,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:M,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
88887 J                                            AED 49,000                https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2212,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:S,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
M2122                                              AED 52,000                https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:22022,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:J,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
S 2212                                             AED 309,000               https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:5000,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:L,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
22022 J                                            AED 9,500                 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:5945,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:H,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_classic
5000 L                                             AED 2,800,000             https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:90,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:Z,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_classic
Dubai                                              AED 760,000               https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:10000,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:H,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_classic

Upvotes: 0

Maaz
Maaz

Reputation: 2445

You have access to the url by searching inside your thumb value.

You can try this:

CODE:

from textwrap import shorten
from bs4 import BeautifulSoup
from urllib.parse import parse_qsl, urljoin, urlparse
import requests

url = 'https://uae.dubizzle.com/motors/number-plates/?page={}'

print('{:^50} {:^15} {:^25} '.format('Title', 'Price', 'Date'))

for page in range(0, 1):   # <--- Increase to number pages you want
    response = requests.get(url.format(page))
    soup = BeautifulSoup(response.text, 'lxml')

    for title, price, date, thumb  in zip(soup.select('.listing-item .title'),soup.select('.listing-item .price'),soup.select('.listing-item .date'),soup.select('.listing-item .thumb')):
        # url = thumb.find('div').get('style').split('url(')[1].split(');')[0])
        print('{:50} {:<25} {:<15}'.format(shorten(title.get_text().strip(),50),price.get_text().strip(), thumb.find('div').get('style').split('url(')[1].split(');')[0]))

Upvotes: 1

Related Questions