Reputation: 306
I want to scrape such that I need two list
ListA = ["Driver Convenience","Exterior Features"]
ListB = ["2 key fob;Collision mitigation braking system;","Body coloured plastic front bumper;Boulder grey exterior door handle;Boulder grey exterior door mirrorn;"]
ListA
will contain text within h4
tags and ListB
will contain text within li
tags until next h4
tag is found .
Here Is a Sample HTML
Code :
<ul class="c-list-table">
<h4 class="c-list-table__section-heading">Driver Convenience</h4>
<li class="c-list-table__item" rel="2-key-fob"><span class="c-list-table__title"> 2 key fob </span</li>
<li class="c-list-table__item" rel="collision-mitigation-braking-system">Collision mitigation braking system</li>
<h4 class="c-list-table__section-heading">Exterior Features</h4>
<li class="c-list-table__item" rel="body-coloured-plastic-front-bumper">Body coloured plastic front bumper</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-handle">Boulder grey exterior door handle</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-mirror">Boulder grey exterior door mirrorn</li>
</ul>
The HTML is same as this one :) Tried many things but couldn't help myself
Upvotes: 1
Views: 95
Reputation: 33384
Use find_next_siblings('li')
to find the li tags after h4 and then verify the text of previous_sibling('h4')
not match with the text then add in to list.
from bs4 import BeautifulSoup
data='''
<ul class="c-list-table">
<h4 class="c-list-table__section-heading">Driver Convenience</h4>
<li class="c-list-table__item" rel="2-key-fob"><span class="c-list-table__title"> 2 key fob </span</li>
<li class="c-list-table__item" rel="collision-mitigation-braking-system">Collision mitigation braking system</li>
<h4 class="c-list-table__section-heading">Exterior Features</h4>
<li class="c-list-table__item" rel="body-coloured-plastic-front-bumper">Body coloured plastic front bumper</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-handle">Boulder grey exterior door handle</li>
<li class="c-list-table__item" rel="boulder-grey-exterior-door-mirror">Boulder grey exterior door mirrorn</li>
</ul>'''
ListA =[]
ListB =[]
soup=BeautifulSoup(data,'lxml')
for item in soup.find_all('h4'):
lifinal=""
ListA.append(item.text)
nextlis=item.find_next_siblings('li')
for li in nextlis:
if li.find_previous_sibling('h4').text in item.text:
lifinal=lifinal+li.text.strip()+";"
ListB.append(lifinal)
print(ListA)
print(ListB)
Output:
['Driver Convenience', 'Exterior Features']
['2 key fob;Collision mitigation braking system;', 'Body coloured plastic front bumper;Boulder grey exterior door handle;Boulder grey exterior door mirrorn;']
Upvotes: 1