Reputation:
I want to extract data using BeautifulSoup in Python.
My document :
<div class="listing-item" data-id="309531" data-score="0">
<div class="thumb">
<a href="https://res.cloudinary.com/">
<div style="background-image:url(https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2292,y_50/co_rgb:FFFFFF,l_text:oswald_100_bold_letter_spacing_5:01,y_-107/c_fit,w_200/abu-dhabi-plate_private-car_classic);"></div>
</a>
</div>
</div>
Here I want to get the background image URL from
<div style="background-image:url(https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2292,y_50/co_rgb:FFFFFF,l_text:oswald_100_bold_letter_spacing_5:01,y_-107/c_fit,w_200/abu-dhabi-plate_private-car_classic);"></div>
My Code :
from textwrap import shorten
from bs4 import BeautifulSoup
from urllib.parse import parse_qsl, urljoin, urlparse
import requests
url = 'https://uae.dubizzle.com/motors/number-plates/?page={}'
print('{:^50} {:^15} {:^25} '.format('Title', 'Pice', 'Date'))
for page in range(0, 40): # <--- Increase to number pages you want
response = requests.get(url.format(page))
soup = BeautifulSoup(response.text, 'lxml')
for title, price, date, thumb in zip(soup.select('.listing-item .title'),
soup.select('.listing-item .price'),
soup.select('.listing-item .date'),
soup.select('.listing-item .thumb')):
print('{:50} {:<25} {:<15}'.format(shorten(title.get_text().strip(), 50), price.get_text().strip(), thumb.get_text().strip()))
how can I get the background image URL from the document?
Upvotes: 1
Views: 112
Reputation: 33384
You need to use the find_next('div') to get the div element and then get the style attribute.Use regular expression to get the Image Url.
Try Below code.
from textwrap import shorten
from bs4 import BeautifulSoup
import requests
import re
url = 'https://uae.dubizzle.com/motors/number-plates/?page={}'
print('{:^50} {:^15} {:^25} '.format('Title', 'Pice', 'Date'))
for page in range(0, 40): # <--- Increase to number pages you want
response = requests.get(url.format(page))
soup = BeautifulSoup(response.text, 'lxml')
for title, price, date, thumb in zip(soup.select('.listing-item .title'),
soup.select('.listing-item .price'),
soup.select('.listing-item .date'),
soup.select('.listing-item .thumb')):
print('{:50} {:<25} {:<15}'.format(shorten(title.get_text().strip(), 50), price.get_text().strip(), re.search("https?:\/\/[^\s]+[^);]", thumb.find_next("div")['style']).group(0)))
Here is some output on console:
G91911 - Excellent for PORSCHE AED 59,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:88887,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:J,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
R 199 AED 49,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2122,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:M,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
88887 J AED 49,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:2212,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:S,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
M2122 AED 52,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:22022,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:J,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
S 2212 AED 309,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:5000,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_5:L,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_new
22022 J AED 9,500 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:5945,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:H,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_classic
5000 L AED 2,800,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:90,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:Z,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_classic
Dubai AED 760,000 https://res.cloudinary.com/dubizzle-com/image/upload/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:10000,x_100,y_-50/co_rgb:242424,l_text:oswald_140_bold_letter_spacing_4:H,x_-240,y_-50/c_fit,w_200/dubai-plate_private-car_classic
Upvotes: 0
Reputation: 2445
You have access to the url by searching inside your thumb
value.
You can try this:
CODE:
from textwrap import shorten
from bs4 import BeautifulSoup
from urllib.parse import parse_qsl, urljoin, urlparse
import requests
url = 'https://uae.dubizzle.com/motors/number-plates/?page={}'
print('{:^50} {:^15} {:^25} '.format('Title', 'Price', 'Date'))
for page in range(0, 1): # <--- Increase to number pages you want
response = requests.get(url.format(page))
soup = BeautifulSoup(response.text, 'lxml')
for title, price, date, thumb in zip(soup.select('.listing-item .title'),soup.select('.listing-item .price'),soup.select('.listing-item .date'),soup.select('.listing-item .thumb')):
# url = thumb.find('div').get('style').split('url(')[1].split(');')[0])
print('{:50} {:<25} {:<15}'.format(shorten(title.get_text().strip(),50),price.get_text().strip(), thumb.find('div').get('style').split('url(')[1].split(');')[0]))
Upvotes: 1