Reputation: 101
I am trying to scrape used car listing prices and names excluding those posted by a dealership. I am having trouble as I would like to put this in a dataframe using panda but can only do so once I can get the right information. Here is the code.
from bs4 import BeautifulSoup as bs4
import requests
import csv
import pandas as pd
import numpy as np
pages_to_scrape=2
pages=[]
prices=[]
names=[]
for i in range(1,pages_to_scrape+1):
url = 'https://www.kijiji.ca/b-cars-trucks/ottawa/used/page-{}/c174l1700185a49'.format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text,'html.parser')
for k in soup.findAll('div', class_='price'):
if k.find(class_='dealer-logo'):
continue
else:
price=k.getText()
prices.append(price.strip())
My code up to here works as intended. Since 'dealer-logo'
is a child of 'price'
. However, I am having trouble having this work for the names, as the 'title'
class is within 'info-container'
where 'price' is also found.
As such, abc=soup.find('a', { 'class' : 'title' })
returns only the first element of the page when I want it to iterate through every listing that does not have 'dealer-logo'
in it, and findAll obviously wouldn't work as it would give every element. findNext gives me a NoneType.
for l in soup.findAll('div', class_='info-container'):
if l.findAll(class_='dealer-logo'):
continue
else:
abc=soup.find('a', { 'class' : 'title' })
name=abc.getText()
names.append(name.strip())
print(names)
print(prices)
Below is the code I am scraping. I want to ignore all instances where 'dealer-logo'
is present, and get the price and title for the listing and add it to a list.
Upvotes: 0
Views: 974
Reputation: 84465
With bs4 4.7.1+ you can use :not
and :has
to filter out the logo items. Select for a parent node then have the two target child items selected in a comprehension where you group them as tuples then convert to DataFrame
with pandas
.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://www.kijiji.ca/b-cars-trucks/ottawa/used/c174l1700185a49')
soup = bs(r.content, 'lxml')
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in
soup.select('.info-container:not(:has(.dealer-logo))') if 'wanted' not in i.select_one('a.title').text.lower())
df
N.B.
It seems at times, one gets slightly more results than you see on page.
I think you can likely also filter out the wanted ads in the css, rather than the if as above, with
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in
soup.select('div:not(div.regular-ad) > .info-container:not(:has(.dealer-logo))') )
Upvotes: 2