Reputation: 394
I want to parse all text blocks(TEXT CONTENT, BODY CONTENT, & EXTRA CONTENT) from the below sample. As you might notice, all these text blocks locate differently inside each 'p' tag.
<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>
I want to have my final result in table format like:
Col1 Col2 Col3
TITLE CONTENT #1 BODY CONTENT #1 EXTRA CONTENT #1
TITLE CONTENT #2 BODY CONTENT #2 EXTRA CONTENT #2
TITLE CONTENT #3 BODY CONTENT #3 EXTRA CONTENT #3
I've tried
for i in soup.find_all('p'):
title = i.find('strong')
if not isinstance(title.nextSibling, NavigableString):
body= title.nextSibling.nextSibling
extra= body.nextSibling.nextSibling
else:
if len(title.nextSibling) > 3:
body= title.nextSibling
extra= body.nextSibling.nextSibling
else:
body= title.nextSibling.nextSibling.nextSibling
extra= body.nextSibling.nextSibling
But it doesn't look efficient. I'm wondering if anyone has any better solutions?
Any help will be really appreciated!
Thanks!
Upvotes: 0
Views: 872
Reputation: 195543
In this case you can use BeautifulSoup's get_text()
method with separator=
parameter:
data = '''<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
print(''.join('{: ^25}'.format(i) for i in p))
Prints:
Col1 Col2 Col3
TITLE CONTENT #1 BODY CONTENT #1 EXTRA CONTENT #1
TITLE CONTENT #2 BODY CONTENT #2 EXTRA CONTENT #2
TITLE CONTENT #3 BODY CONTENT #3 EXTRA CONTENT #3
Upvotes: 0
Reputation: 746
another way using slicing, assuming that your list is not variable
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")
def slicing(l):
new_list = []
for i in range(0,len(l),3):
new_list.append(l[i:i+3])
return new_list
result = slicing(list(soup.stripped_strings))
print(result)
output
[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]
Upvotes: 0
Reputation: 1734
It is important to note that .next_sibling
could work as well, you'd have to use some logic to know how many times to call it as you may need to gather multiple text nodes. In this example, I find it easier to simply navigate the descendants noting important characteristics that signal for me to do something different.
You simply have to break down the characteristics of what you are scraping. In this simple case, we know:
strong
element, we want to capture the "title".br
element, we want to start capturing the "content" .br
element, we want to start capturing the "extra content".We can:
plans
class to get all the plans.plans
.from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString
html = """
<p class="plans">
<strong>
TITLE CONTENT #1
</strong>
<br/>
BODY CONTENT #1
<br/>
EXTRA CONTENT #1
</p>
<p class="plans">
<strong>
TITLE CONTENT #2
<br/>
</strong>
BODY CONTENT #2
<br/>
EXTRA CONTENT #2
</p>
<p class="plans">
<strong>
TITLE CONTENT #3
</strong>
<br/>
BODY CONTENT #3
<br/>
EXTRA CONTENT #3
</p>
"""
soup = bs(html, 'html.parser')
content = []
# Iterate through all the plans
for plans in soup.select('.plans'):
# Lists that will hold the text nodes of interest
title = []
body = []
extra = []
current = None # Reference to one of the above lists to store data
br = 0 # Count number of br tags
# Iterate through all the descendant nodes of a plan
for node in plans.descendants:
# See if the node is a Tag/Element
if isinstance(node, Tag):
if node.name == 'strong':
# Strong tags/elements contain our title
# So set the current container for text to the the title list
current = title
elif node.name == 'br':
# We've found a br Tag/Element
br += 1
if br == 1:
# If this is the first, we need to set the current
# container for text to the body list
current = body
elif br == 2:
# If this is the second, we need to set the current
# container for text to the extra list
current = extra
elif isinstance(node, NavigableString) and current is not None:
# We've found a navigable string (not a tag/element), so let's
# store the text node in the current list container.
# NOTE: You may have to filter out things like HTML comments in a real world example.
current.append(node)
# Store the captured title, body, and extra text for the current plan.
# For each list, join the text into one string and strip leading and trailing whitespace
# from each entry in the row.
content.append([''.join(entry).strip() for entry in (title, body, extra)])
print(content)
Then you can print the data anyway you want, but you should have it captured in a nice logical way as shown below:
[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]
There are multiple ways to do this, this is just one.
Upvotes: 1