Mannix
Mannix

Reputation: 431

Getting text from HTML with python

I have HTML data and I want to get all the text between the

tags and put it into dataframes for further processing.

But I only want the text in the

tags that are between these tags:

            <div class="someclass" itemprop="text">
                    <p>some text</p>
            </div>

Using BeautifulSoup I can get text between all the

tags easy enough. But as I said, I don't want it unless it is between those tags.

Upvotes: 0

Views: 1756

Answers (3)

ewwink
ewwink

Reputation: 19164

if you want to select p with parent div and has class someclass you can

html = '''<div class="someclass" itemprop="text">
            <p>some text</p>
            <span>not this text</span>   
          </div>
          <div class="someclass" itemprop="text">
            <div>not this text</div>   
          </div>
'''

soup = BeautifulSoup(html, 'html.parser')
p = soup.select_one('div.someclass p') # or select()
print(p.text)
# some text

Upvotes: 0

Daniel Scott
Daniel Scott

Reputation: 985

In case you need a table-specific solution, I would try something like this (daveedwards answer is better if you're not!):

import lxml
from bs4 import BeautifulSoup

innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')

# Identify the table that will contain your <div> tags by its class
table = soup.find('table', attrs={'class':'class_name_of_table_here'})
table_body = table.find('tbody')
divs = table_body.find_all(['div'], attrs={'class':['someclass']})

for div in divs:
    try:
        selected_text = div.text
    except:
        pass

print(selected_text)

Upvotes: 1

If want text that is in tags that are associated with only a specific class, with BeautifulSoup you can specify those specific classes with the attrs attribute:

html = '''<div class="someclass" itemprop="text">
                    <p>some text</p>
            </div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = soup.find_all('div', attrs={'class': 'someclass'})

for tag in tags:
    print(tag.text.strip())

output:

some text

Upvotes: 2

Related Questions