Reputation: 2743
The most common repetitive structure of the HTML is:
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
in such situations I grab the text it is possible for you
Occasionally (i.e., not always), the <p>
of class="Standard"
has a sibling <p>
of class="P3"
, like so:
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
When this <p>
of class="P3"
is present, I want to additionally grab the text inside it, e.g. here I would additionally grab: (to ask a question in Spanish, you just use inflection)
My question is, given this kind of structure:
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
How can I produce output like this:
it is possible for you
it is acceptable for me
(to ask a question in Spanish, you just use inflection)
Currently, I've managed to do this:
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find("span", class_="T3")
print(p_english.contents[0])
And the output I get is:
it is possible for you
it is acceptable for me
Upvotes: 0
Views: 56
Reputation: 84465
I think it is more efficient to use css Or syntax and an adjacent sibling combinator to perform this
from bs4 import BeautifulSoup as bs
html = '''
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.Standard, .Standard + .P3')]
print(items)
Upvotes: 1
Reputation: 764
use this :
Python Code :
from bs4 import BeautifulSoup
import re
text = '''
<div>
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
</div>
'''
soup = BeautifulSoup(text,features='html.parser')
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find('span',attrs={'class':'T3'})
nextSibling = p_standard.find_next_sibling()
print(p_english.text)
if(nextSibling.attrs['class'][0] == 'P3' and nextSibling.name == 'p'):
print(nextSibling.text)
Demo : Here
Explanation :
class
value within the find_next_sibling's
returned element i had to search into the variables of the instance
its self as there is no doc that mentions it on the official website
so i printed nextSibling.__dict__.keys()
0
index is because the class attribute's type is an arrayUpvotes: 1