Frank Harb
Frank Harb

Reputation: 79

Beautifulsoup parsing data under specific tag

Right now I am parsing a web page with this code:

boards = soup(itemprop="name")
prices = soup("span", { "class" : "price-currency" })

for board, price in zip(boards, prices):
    print(board.text.strip(), price.next_sibling)

And it prints the board and the price like this:

SURFBOARD RACK free delivery to your door 120.00
Huge Beginner Surfboard Sale! Kids & Adult Softboards all 1/2 Price!! 90.00
Mega Softboard Clearance Sale! Beginner Foam SurfBoards 1/2 Price! 90.00
Surfboard 6'2" Simon Anderson Spudnick 360.00
Surfboard Cover, Surfboard Bags, Cheap Single Surf Board Bags 50.00

The web page that I am parsing is split into 3 sections: sponsored links, top ads, and recent ads. I am printing data from all 3 of these sections, but want data only from the recent ads section, which has this html:

<div class="module__body ad-listing">

How do I specify that I only want the boards and prices printed from beneath this section?

Page: https://www.gumtree.com.au/s-surfing/mona-vale-sydney/surfboard/k0c18568l3003999r10?fromSearchBox=true

Upvotes: 3

Views: 63

Answers (1)

Bill Bell
Bill Bell

Reputation: 21643

You may detest this answer. My inclination is to use the lxml module when I see complicated HTML like that because I can use xpath expressions.

In this case the first xpath finds the collection of li elements in the HTML that you want. The loop uses two xpath expressions, one that finds stuff like "Quicksale 6'4 Dylan Surfboard RX5" within an li element and one that finds the collection of texts for price information within the same element. Item 12 seems to be coded differently; I haven't investigated that.

>>> import requests
>>> from lxml import etree
>>> page = requests.get('https://www.gumtree.com.au/s-surfing/mona-vale-sydney/surfboard/k0c18568l3003999r10?fromSearchBox=true').text
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(page, parser=parser)
>>> recents = tree.xpath('.//div[@class="module__body ad-listing"]/ul/li')
>>> for i, recent in enumerate(recents):
...     try:
...         i, recent.xpath('.//span[@itemprop="name"]/text()')[0].strip()
...     except:
...         '-------------> item', i, 'failed'
...         continue
...     one_span = first_recent.xpath('.//span[@class="j-original-price"]')[0]
...     ' '.join([_.strip() for _ in list(one_span.itertext()) if _.strip()])
... 
(0, "Quicksale 6'4 Dylan Surfboard RX5")
'$ 450.00 Negotiable'
(1, 'DHD 5\'9 "Switchblade" Surfboard')
'$ 450.00 Negotiable'
(2, '6ft Modern Surfboards - Highline')
'$ 450.00 Negotiable'
(3, "5'11 Channel Island T-Low surfboard")
'$ 450.00 Negotiable'
(4, 'Chill Rare Bird Surfboard 5"8')
'$ 450.00 Negotiable'
(5, 'Vintage surfboard')
'$ 450.00 Negotiable'
(6, "5'7 Annesley Blonde model")
'$ 450.00 Negotiable'
(7, 'McCoy single fin surfboard')
'$ 450.00 Negotiable'
(8, 'Sculpt surfboard')
'$ 450.00 Negotiable'
(9, '8\'1" longboard surfboard travel cover')
'$ 450.00 Negotiable'
(10, 'Longboard Surfboard')
'$ 450.00 Negotiable'
(11, "5'10 Custom Chaos Surfboard")
'$ 450.00 Negotiable'
('-------------> item', 12, 'failed')
(13, "6'0 JS lowdown")
'$ 450.00 Negotiable'
(14, 'Mega Softboard Clearance Sale! Beginner Foam SurfBoards 1/2 Price!')
'$ 450.00 Negotiable'
(15, 'Surfboard')
'$ 450.00 Negotiable'
(16, 'Surfboard 5\'10" 30 lt')
'$ 450.00 Negotiable'
(17, 'Christenson Super Sport Surfboard')
'$ 450.00 Negotiable'
(18, 'TOMO Firewire V4 Surfboard')
'$ 450.00 Negotiable'
(19, "Surfboard 6'6 baked bean")
'$ 450.00 Negotiable'
(20, 'foam surfboards')
'$ 450.00 Negotiable'
(21, 'Channel Islands surfboard')
'$ 450.00 Negotiable'
(22, 'Channel Islands Surfboard')
'$ 450.00 Negotiable'
(23, 'JS surfboard')
'$ 450.00 Negotiable'
(24, 'CLASSIC RETRO SURF FACTORY MINI MAL')
'$ 450.00 Negotiable'
(25, 'Surfboard JS')
'$ 450.00 Negotiable'

Upvotes: 1

Related Questions