Can I modify the text within a beautiful soup tag without converting it into a string?

Question

I am working on parsing a scraped webpage using BeautifulSoup and, as always, there are weird exceptions to the page's regular formatting.

What I have so far is a table, and I have gotten all of the rows into rows, and all of the columns into cols (which contains all of the s) and then I get the plain text from the element to use later.

This looks like:

soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]

for row in rows:
    cols = row.findAll('td')
    cols = [ele.text.strip() for ele in cols]

data.append([ele for ele in cols if ele]) # Get rid of empty values

return data

The problem is that sometimes, one of the s contains several

s, and I want to preserve these by replacing them with

s. Right now, using the .text attribute of ele strips out all of the tags, including the

s.

My question is this: is it possible to use .text in a way which preserves only certain tags? I know I could convert the ele into a string first, but then I can't have beautiful soup automatically remove all of the other ugly tags.

Here is an example of the html where the contains

s:

 November General Election Day.Scheduled Elections:
    
        Federal, Statewide, Legislative and Judicial Offices
        County Offices
        Initiatives and Constitutional Amendments, if applicable

Right now, my code outputs:

u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'

and I would like it to look more like:

u'November General Election Day.Scheduled Elections:
Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable'

alecxe · Accepted Answer

I'm still not sure what is the motivation behind this question, but here's the idea.

Find all li tags and insert() a new-line character at the beginning of the contents.

Working example (I've added some other tags to the td to demonstrate the behavior):

from bs4 import BeautifulSoup

data = """
 November General Election Day.Scheduled Elections:
    My Test String 
    
        Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable
    
    New Paragraph

"""

soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
    element.insert(0, '
')

print soup.td.text

Prints:

November General Election Day.Scheduled Elections:
    My Test String 


Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable

New Paragraph

Here is how you can apply the solution in your case:

from bs4 import BeautifulSoup

html = """

    
         November General Election Day.Scheduled Elections:
            My Test String 
            
                Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable
            
            New Paragraph
        
    

"""

soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")

data = []
for row in rows:
    for element in row.select('td li'):
        element.insert(0, '
')
    data.append([ele.text.strip() for ele in row('td')])

print data

Can I modify the text within a beautiful soup tag without converting it into a string?

Answers (1)

Related Questions