Reputation: 392
I am working on parsing a scraped webpage using BeautifulSoup and, as always, there are weird exceptions to the page's regular formatting.
What I have so far is a table, and I have gotten all of the rows into rows
, and all of the columns into cols
(which contains all of the <td>
s) and then I get the plain text from the element to use later.
This looks like:
soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]
for row in rows:
cols = row.findAll('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
return data
The problem is that sometimes, one of the <td>
s contains several <li>
s, and I want to preserve these by replacing them with \n
s. Right now, using the .text
attribute of ele
strips out all of the tags, including the <li>
s.
My question is this: is it possible to use .text
in a way which preserves only certain tags? I know I could convert the ele
into a string first, but then I can't have beautiful soup automatically remove all of the other ugly tags.
Here is an example of the html where the <td>
contains <li>
s:
<td> November General Election Day.Scheduled Elections:
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li>
<li>County Offices</li>
<li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
</td>
Right now, my code outputs:
u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'
and I would like it to look more like:
u'November General Election Day.Scheduled Elections:\nFederal, Statewide, Legislative and Judicial Offices\nCounty Offices\nInitiatives and Constitutional Amendments, if applicable'
Upvotes: 3
Views: 4506
Reputation: 473873
I'm still not sure what is the motivation behind this question, but here's the idea.
Find all li
tags and insert()
a new-line character at the beginning of the contents.
Working example (I've added some other tags to the td
to demonstrate the behavior):
from bs4 import BeautifulSoup
data = """
<td> November General Election Day.Scheduled Elections:
<b>My Test String </b>
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
<p>New Paragraph</p>
</td>
"""
soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
element.insert(0, '\n')
print soup.td.text
Prints:
November General Election Day.Scheduled Elections:
My Test String
Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable
New Paragraph
Here is how you can apply the solution in your case:
from bs4 import BeautifulSoup
html = """
<table class="election">
<tr>
<td> November General Election Day.Scheduled Elections:
<b>My Test String </b>
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
<p>New Paragraph</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")
data = []
for row in rows:
for element in row.select('td li'):
element.insert(0, '\n')
data.append([ele.text.strip() for ele in row('td')])
print data
Upvotes: 2