Reputation: 699
I would like to extract data from a website, and I need to know if it contains some of the equipment. As the example below, I know A has CD, but he doesn't have CDA.
HTML:
<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>
My code:
res = requests.get('https://www.acd.com/carinfo-4434.php')
soup=BeautifulSoup(res.text,'lxml')
for item in soup.find_all(attrs={'class':'ABC'}):
for link in item.find_all('li'):
print(link)
From my code, I will extract all the li from the HTML, like this:
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
But that's not what I want. What I wanna do, is to extract from "li class" and text, the hope the result will be like this:
specChecked, CD
specChecked, VCD
, CDA
(Or maybe I can just replace specChecked as 1 and blank space as 0)
Upvotes: 0
Views: 3023
Reputation: 22440
You can do something like below to get the content of desired class along with empty one.
from bs4 import BeautifulSoup
content = """
<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
print("{}, {}".format(' '.join(item['class']),item.text))
Output:
specChecked, CD
specChecked, VCD
, CDA
Upvotes: 3
Reputation: 82765
s = """<div class="ABC">
<h3>A</h3>
<ul>
<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li>
</ul>
<h3>B</h3>
<div class="buyCarDetailContentSpecContent ">
<ul>
<li>
<p>b1<span>1</span></p>
</li>
<li>
<p>b2<span>2</span></p>
</li>
</ul>
</div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all('li'):
if link.has_attr("class"):
print(link.get("class", ""), link.text)
Output:
[u'specChecked'], u'CD'
[u'specChecked'], u'VCD'
[u''], u'CDA'
has_attr
to check if li
has class attributelink.get
to get the class valuelink.text
to extract the text.Upvotes: 2