BeautifulSoup get text from child element within container

Question

I am trying to scrape used car listing prices and names excluding those posted by a dealership. I am having trouble as I would like to put this in a dataframe using panda but can only do so once I can get the right information. Here is the code.

from bs4 import BeautifulSoup as bs4
import requests
import csv
import pandas as pd
import numpy as np


pages_to_scrape=2
pages=[]
prices=[]
names=[]

for i in range(1,pages_to_scrape+1):
  url = 'https://www.kijiji.ca/b-cars-trucks/ottawa/used/page-{}/c174l1700185a49'.format(i)
  pages.append(url)

for item in pages:
  page = requests.get(item)
  soup = bs4(page.text,'html.parser')
  for k in soup.findAll('div', class_='price'):
    if k.find(class_='dealer-logo'):
      continue
    else: 
      price=k.getText()
      prices.append(price.strip())

My code up to here works as intended. Since 'dealer-logo' is a child of 'price'. However, I am having trouble having this work for the names, as the 'title' class is within 'info-container' where 'price' is also found.

As such, abc=soup.find('a', { 'class' : 'title' }) returns only the first element of the page when I want it to iterate through every listing that does not have 'dealer-logo' in it, and findAll obviously wouldn't work as it would give every element. findNext gives me a NoneType.

  for l in soup.findAll('div', class_='info-container'):
    if l.findAll(class_='dealer-logo'):
      continue
    else:
      abc=soup.find('a', { 'class' : 'title' })
      name=abc.getText()
      names.append(name.strip())

print(names)
print(prices)

Below is the code I am scraping. I want to ignore all instances where 'dealer-logo' is present, and get the price and title for the listing and add it to a list.

QHarr · Accepted Answer

With bs4 4.7.1+ you can use :not and :has to filter out the logo items. Select for a parent node then have the two target child items selected in a comprehension where you group them as tuples then convert to DataFrame with pandas.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://www.kijiji.ca/b-cars-trucks/ottawa/used/c174l1700185a49')
soup = bs(r.content, 'lxml')
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in 
                   soup.select('.info-container:not(:has(.dealer-logo))') if 'wanted' not in i.select_one('a.title').text.lower())
df

N.B.

It seems at times, one gets slightly more results than you see on page.

I think you can likely also filter out the wanted ads in the css, rather than the if as above, with

df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in 
                   soup.select('div:not(div.regular-ad) > .info-container:not(:has(.dealer-logo))') )

BeautifulSoup get text from child element within container

Answers (1)

Related Questions