Unknown Shin
Unknown Shin

Reputation: 119

how to scrape text from <p> without getting unnecessary part in this structure?

website code looks like this:

<ul class="article-list">
  <li>
    <p class="promo">
      "sentence sentence sentence sentence"
      <a class="readmore" href="https://link.blahblah.com"> Read more >> </a>
    </p>
  </li>
</ul>

I tried

ul = soup.find_all("ul", class_= "article-list")  
for elem in ul:
    lis = elem.find_all("li")
    for x in lis:
        preview = x.find("p", class_="promo").get_text()

this returns

"sentence sentence sentence sentence     Read more"

How can I return "sentence sentence sentence sentence" only without "Read more"?

Upvotes: 0

Views: 74

Answers (3)

NotSoary
NotSoary

Reputation: 11

you could try adding to a list

soup = bs(resp, 'html.parser')

ul = soup.find_all("ul", class_= "article-list")
preview = []
for elem in ul:
    lis = elem.find_all("li")
    for x in lis:
        preview = x.find("p", class_="promo")
        preview.append(x.text)

Upvotes: 0

Eren Han
Eren Han

Reputation: 321

Im not sure though

preview = x.find("p", class_="promo").a.text

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195448

You can use .find_next() method with text=True parameter:

data = '''<ul class="article-list">
<li>
<p class="promo">
"sentence sentence sentence sentence"
<a class="readmore" href="https://link.blahblah.com"> Read more >> </a>
</p>
</li>
</ul>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

print(soup.select_one('p.promo').find_next(text=True))

Prints:

"sentence sentence sentence sentence"

Upvotes: 1

Related Questions