Lara19
Lara19

Reputation: 699

Python: extract class and text

I would like to extract data from a website, and I need to know if it contains some of the equipment. As the example below, I know A has CD, but he doesn't have CDA.

HTML:

<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>

My code:

res = requests.get('https://www.acd.com/carinfo-4434.php')
soup=BeautifulSoup(res.text,'lxml')
for item in soup.find_all(attrs={'class':'ABC'}):       
    for link in item.find_all('li'):
        print(link)

From my code, I will extract all the li from the HTML, like this:

<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li> 
<li>
    <p>b1<span>1</span></p>
</li>
<li>
    <p>b2<span>2</span></p>
</li>

But that's not what I want. What I wanna do, is to extract from "li class" and text, the hope the result will be like this:

specChecked, CD
specChecked, VCD
, CDA

(Or maybe I can just replace specChecked as 1 and blank space as 0)

Upvotes: 0

Views: 3023

Answers (2)

SIM
SIM

Reputation: 22440

You can do something like below to get the content of desired class along with empty one.

from bs4 import BeautifulSoup

content = """
<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
    print("{}, {}".format(' '.join(item['class']),item.text))

Output:

specChecked, CD
specChecked, VCD
, CDA

Upvotes: 3

Rakesh
Rakesh

Reputation: 82765

s = """<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all('li'):
    if link.has_attr("class"):
        print(link.get("class", ""), link.text)

Output:

[u'specChecked'], u'CD'
[u'specChecked'], u'VCD'
[u''], u'CDA'
  • You can use has_attr to check if li has class attribute
  • link.get to get the class value
  • link.text to extract the text.

Upvotes: 2

Related Questions