Conor Strejcek
Conor Strejcek

Reputation: 392

Can I modify the text within a beautiful soup tag without converting it into a string?

I am working on parsing a scraped webpage using BeautifulSoup and, as always, there are weird exceptions to the page's regular formatting.

What I have so far is a table, and I have gotten all of the rows into rows, and all of the columns into cols (which contains all of the <td>s) and then I get the plain text from the element to use later.

This looks like:

soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]

for row in rows:
    cols = row.findAll('td')
    cols = [ele.text.strip() for ele in cols]

data.append([ele for ele in cols if ele]) # Get rid of empty values

return data

The problem is that sometimes, one of the <td>s contains several <li>s, and I want to preserve these by replacing them with \ns. Right now, using the .text attribute of ele strips out all of the tags, including the <li>s.

My question is this: is it possible to use .text in a way which preserves only certain tags? I know I could convert the ele into a string first, but then I can't have beautiful soup automatically remove all of the other ugly tags.

Here is an example of the html where the <td> contains <li>s:

<td> November General Election Day.Scheduled Elections:
    <ul class="vtips">
        <li>Federal, Statewide, Legislative and Judicial Offices</li>
        <li>County Offices</li>
        <li>Initiatives and Constitutional Amendments, if applicable</li>
    </ul>
</td>

Right now, my code outputs:

u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'

and I would like it to look more like:

u'November General Election Day.Scheduled Elections:\nFederal, Statewide, Legislative and Judicial Offices\nCounty Offices\nInitiatives and Constitutional Amendments, if applicable'

Upvotes: 3

Views: 4506

Answers (1)

alecxe
alecxe

Reputation: 473873

I'm still not sure what is the motivation behind this question, but here's the idea.

Find all li tags and insert() a new-line character at the beginning of the contents.

Working example (I've added some other tags to the td to demonstrate the behavior):

from bs4 import BeautifulSoup

data = """
<td> November General Election Day.Scheduled Elections:
    <b>My Test String </b>
    <ul class="vtips">
        <li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
    </ul>
    <p>New Paragraph</p>
</td>
"""

soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
    element.insert(0, '\n')

print soup.td.text

Prints:

November General Election Day.Scheduled Elections:
    My Test String 


Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable

New Paragraph

Here is how you can apply the solution in your case:

from bs4 import BeautifulSoup

html = """
<table class="election">
    <tr>
        <td> November General Election Day.Scheduled Elections:
            <b>My Test String </b>
            <ul class="vtips">
                <li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
            </ul>
            <p>New Paragraph</p>
        </td>
    </tr>
</table>
"""

soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")

data = []
for row in rows:
    for element in row.select('td li'):
        element.insert(0, '\n')
    data.append([ele.text.strip() for ele in row('td')])

print data

Upvotes: 2

Related Questions