Reputation: 431
I have HTML data and I want to get all the text between the
tags and put it into dataframes for further processing.
But I only want the text in the
tags that are between these tags:
<div class="someclass" itemprop="text">
<p>some text</p>
</div>
Using BeautifulSoup I can get text between all the
tags easy enough. But as I said, I don't want it unless it is between those tags.
Upvotes: 0
Views: 1756
Reputation: 19164
if you want to select p
with parent div
and has class someclass
you can
html = '''<div class="someclass" itemprop="text">
<p>some text</p>
<span>not this text</span>
</div>
<div class="someclass" itemprop="text">
<div>not this text</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
p = soup.select_one('div.someclass p') # or select()
print(p.text)
# some text
Upvotes: 0
Reputation: 985
In case you need a table-specific solution, I would try something like this (daveedwards answer is better if you're not!):
import lxml
from bs4 import BeautifulSoup
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')
# Identify the table that will contain your <div> tags by its class
table = soup.find('table', attrs={'class':'class_name_of_table_here'})
table_body = table.find('tbody')
divs = table_body.find_all(['div'], attrs={'class':['someclass']})
for div in divs:
try:
selected_text = div.text
except:
pass
print(selected_text)
Upvotes: 1
Reputation: 8057
If want text that is in tags that are associated with only a specific class, with BeautifulSoup you can specify those specific classes with the attrs
attribute:
html = '''<div class="someclass" itemprop="text">
<p>some text</p>
</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('div', attrs={'class': 'someclass'})
for tag in tags:
print(tag.text.strip())
output:
some text
Upvotes: 2