hello11
hello11

Reputation: 105

How to scrape one P tag from multiple p tags from BeautifulSoup

The website has multiple P tags but I just want to scrape one of the tags. Website inspection as per below:

<div class="sidebar sbt">
 <h4>history</h4>
   <p class="top">
        <strong>First </strong><br>
              Jun 2017
   </p>
   <p class="top">
        <strong>Page </strong><br>
        Last 30 days: <strong>200</strong>        
   </p>
   <p class="top">
        <strong>Last </strong><br>
        2019
    </p>
        </div>

As per the above there are numerous P tags, if I wanted to scrape just one of them e.g First Jun 2017, how would I do that with the soup.findAll(..) function?

Upvotes: 1

Views: 3357

Answers (4)

Humayun Ahmad Rajib
Humayun Ahmad Rajib

Reputation: 1560

You can try it. Here I am using soup.findAll(..) function:

from bs4 import BeautifulSoup
import json
import requests
html_doc="""
<div class="sidebar sbt">
 <h4>history</h4>
   <p class="top">
        <strong>First </strong><br>
              Jun 2017
   </p>
   <p class="top">
        <strong>Page </strong><br>
        Last 30 days: <strong>200</strong>        
   </p>
   <p class="top">
        <strong>Last </strong><br>
        2019
    </p>
        </div>
"""
soup = BeautifulSoup(html_doc, 'lxml')
result = soup.findAll('p')[0].text
print(" ".join(result.split()))

Output will be:

First Jun 2017

Upvotes: 2

Bill Bell
Bill Bell

Reputation: 21643

You seem to want to target p elements according to text. Here's one way of doing that.

The most significant line is the one that uses a regular expression to find 'Last 30 days', which is just part of the string in a p element. Having found this string you can find its parent and then display the entire text of that parent or other chunks of the parent.

Notice that since I've used find_all the result is a list (because there could be more than one item). I needed to choose the first, element zero.

>>> import bs4
>>> HTML = open('temp.htm').read()
>>> for line in HTML.split('\n'):
...     print (line)
...     
<div class="sidebar sbt">
 <h4>history</h4>
   <p class="top">
        <strong>First </strong><br>
              Jun 2017
   </p>
   <p class="top">
        <strong>Page </strong><br>
        Last 30 days: <strong>200</strong>        
   </p>
   <p class="top">
        <strong>Last </strong><br>
        2019
    </p>
        </div>
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> target = soup.find_all(string=re.compile('Last 30 days'))
>>> target
['\n        Last 30 days: ']
>>> target[0].findParent()
<p class="top">
<strong>Page </strong><br/>
        Last 30 days: <strong>200</strong>
</p>
>>> target[0].findParent().text
'\nPage \n        Last 30 days: 200\n'

Upvotes: 0

Shashank
Shashank

Reputation: 1135

Type soup.p and this will give you the first result from the given HTML data.

>>> from bs4 import BeautifulSoup
>>> htmlData = '''
... <div class="sidebar sbt">
...  <h4>history</h4>
...    <p class="top">
...         <strong>First </strong><br>
...               Jun 2017
...    </p>
...    <p class="top">
...         <strong>Page </strong><br>
...         Last 30 days: <strong>200</strong>        
...    </p>
...    <p class="top">
...         <strong>Last </strong><br>
...         2019
...     </p>
...         </div>
... '''
>>>
>>> soup = BeautifulSoup(htmlData, 'html.parser')
>>> soup.p
<p class="top">
<strong>First </strong><br>
              Jun 2017
   </br></p>
>>> 

If we want to scrape the nth data then

soup.select("p:nth-of-type(n)")

Example:

>>> soup.select("p:nth-of-type(3)")
[<p class="top">
<strong>Last </strong><br>
        2019
    </br></p>]
>>> soup.select("p:nth-of-type(2)")
[<p class="top">
<strong>Page </strong><br>
        Last 30 days: <strong>200</strong>
</br></p>]
>>> soup.select("p:nth-of-type(1)")
[<p class="top">
<strong>First </strong><br>
              Jun 2017
   </br></p>]
>>>

More about CSS selectors

Another alternate way, you can try to find all the p tags and then iterate over it to find the desire one.

Upvotes: 1

Nikhil Rajawat
Nikhil Rajawat

Reputation: 121

You can use .getText() and compare with the text you want, after you have got all the <p> tags.

Upvotes: 0

Related Questions