Reputation: 105
The website has multiple P tags but I just want to scrape one of the tags. Website inspection as per below:
<div class="sidebar sbt">
<h4>history</h4>
<p class="top">
<strong>First </strong><br>
Jun 2017
</p>
<p class="top">
<strong>Page </strong><br>
Last 30 days: <strong>200</strong>
</p>
<p class="top">
<strong>Last </strong><br>
2019
</p>
</div>
As per the above there are numerous P tags, if I wanted to scrape just one of them e.g First Jun 2017, how would I do that with the soup.findAll(..) function?
Upvotes: 1
Views: 3357
Reputation: 1560
You can try it. Here I am using soup.findAll(..)
function:
from bs4 import BeautifulSoup
import json
import requests
html_doc="""
<div class="sidebar sbt">
<h4>history</h4>
<p class="top">
<strong>First </strong><br>
Jun 2017
</p>
<p class="top">
<strong>Page </strong><br>
Last 30 days: <strong>200</strong>
</p>
<p class="top">
<strong>Last </strong><br>
2019
</p>
</div>
"""
soup = BeautifulSoup(html_doc, 'lxml')
result = soup.findAll('p')[0].text
print(" ".join(result.split()))
Output will be:
First Jun 2017
Upvotes: 2
Reputation: 21643
You seem to want to target p
elements according to text. Here's one way of doing that.
The most significant line is the one that uses a regular expression to find 'Last 30 days', which is just part of the string in a p
element. Having found this string you can find its parent and then display the entire text
of that parent or other chunks of the parent.
Notice that since I've used find_all
the result is a list (because there could be more than one item). I needed to choose the first, element zero.
>>> import bs4
>>> HTML = open('temp.htm').read()
>>> for line in HTML.split('\n'):
... print (line)
...
<div class="sidebar sbt">
<h4>history</h4>
<p class="top">
<strong>First </strong><br>
Jun 2017
</p>
<p class="top">
<strong>Page </strong><br>
Last 30 days: <strong>200</strong>
</p>
<p class="top">
<strong>Last </strong><br>
2019
</p>
</div>
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> target = soup.find_all(string=re.compile('Last 30 days'))
>>> target
['\n Last 30 days: ']
>>> target[0].findParent()
<p class="top">
<strong>Page </strong><br/>
Last 30 days: <strong>200</strong>
</p>
>>> target[0].findParent().text
'\nPage \n Last 30 days: 200\n'
Upvotes: 0
Reputation: 1135
Type soup.p and this will give you the first result from the given HTML data.
>>> from bs4 import BeautifulSoup
>>> htmlData = '''
... <div class="sidebar sbt">
... <h4>history</h4>
... <p class="top">
... <strong>First </strong><br>
... Jun 2017
... </p>
... <p class="top">
... <strong>Page </strong><br>
... Last 30 days: <strong>200</strong>
... </p>
... <p class="top">
... <strong>Last </strong><br>
... 2019
... </p>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(htmlData, 'html.parser')
>>> soup.p
<p class="top">
<strong>First </strong><br>
Jun 2017
</br></p>
>>>
If we want to scrape the nth data then
soup.select("p:nth-of-type(n)")
Example:
>>> soup.select("p:nth-of-type(3)")
[<p class="top">
<strong>Last </strong><br>
2019
</br></p>]
>>> soup.select("p:nth-of-type(2)")
[<p class="top">
<strong>Page </strong><br>
Last 30 days: <strong>200</strong>
</br></p>]
>>> soup.select("p:nth-of-type(1)")
[<p class="top">
<strong>First </strong><br>
Jun 2017
</br></p>]
>>>
Another alternate way, you can try to find all the p tags and then iterate over it to find the desire one.
Upvotes: 1
Reputation: 121
You can use .getText()
and compare with the text you want, after you have got all the <p>
tags.
Upvotes: 0