Reputation: 15152
I have unordered list like this in HTML:
<ul>
<li class="label">Equipement</li>
<li>Aluminum tyres</li>
<li>4x4</li>
<li>3. stop lights</li>
<li>Bluetooth</li>
</ul>
Only first li
element in the ul
list contains title of the list, other elements contain list of features that needs to be extracted in plain text.
I know how to locate that first li
but I don't know how to select all other elements.
Consider that this ul
doesn't have class and its in the HTML document with a lot of other ul
elements.
I can locate that ul
through li
with:
(li.previousSibling).get_text()
but cannot extract all elements with get_text()
, I'm getting:
AttributeError: 'NavigableString' object has no attribute 'get_text'
Also I need to extract all li
except first one which holds title. I have several ul
on page like this and they are all variable in lenght (have more or less li
elements).
EDIT
My code so far. I'm finding elements with:
carBasics = soup.select('li.label')
for li in carBasics:
if li.contents[0]=="Equipement":
carAdditionalEquipement = (li.previousSibling).find_all('li')
AttributeError: 'NavigableString' object has no attribute 'get_text'
Upvotes: 0
Views: 1348
Reputation: 15152
Idea is to omit first li
.
No one gave answer to that so this is how I did it in the end:
for item in soup.select("ul li.labela"):
if item.text=="Equipement":
carAdditionalEquipement = li.parent.text[len(li.contents[0])+1:].strip().splitlines()
From that I'm getting nice list without first line which is taken out with [len(li.contents[0])+1:]
.
Basically I'm chopping off lenght of firsts element from string list and splitting it than since there is newline char on the end of each list
Upvotes: 0
Reputation: 84455
Use a css general sibling combinator and with bs4 4.7.1+ you can use :contains to specify the label text as well if known
from bs4 import BeautifulSoup as bs
html = '''
<ul>
<li class="label">Equipement</li>
<li>Aluminum tyres</li>
<li>4x4</li>
<li>3. stop lights</li>
<li>Bluetooth</li>
</ul>
'''
soup = bs(html, 'lxml')
print([li.text for li in soup.select('.label:contains("Equipement") ~ li')])
Upvotes: 1
Reputation: 33384
Use find_next_siblings
()
from bs4 import BeautifulSoup
html='''<ul>
<li class="label">Equipement</li>
<li>Aluminum tyres</li>
<li>4x4</li>
<li>3. stop lights</li>
<li>Bluetooth</li>
</ul>
<ul>
<li class="label">Equipement</li>
<li>Aluminum tyres</li>
<li>4x4</li>
<li>3. stop lights</li>
<li>Bluetooth</li>
</ul>'''
soup = BeautifulSoup(html, 'lxml')
for item in soup.select("ul li.label"):
if item.text=="Equipement":
siblings=[s.text for s in item.find_next_siblings('li')]
print(siblings)
Edited the answer:
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.index.hr/oglasi/bmw-serija-5-3-0-xd/oid/1971034')
soup = BeautifulSoup(html.content, 'html.parser')
for item in soup.select("ul li.labela"):
if item.text=="Dodatna oprema vozila":
siblings=[s.text for s in item.find_next_siblings('li')]
print(siblings)
Upvotes: 1
Reputation: 11505
from bs4 import BeautifulSoup
import requests
html = requests.get(
'yoururl')
soup = BeautifulSoup(html.content, 'html.parser')
for li in soup.select('ul li.labela'):
if li.contents[0]=="Equipement":
print(li.parent.text)
Upvotes: 1