Reputation:
I have html "page" as follows:
<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Times New Roman","serif"'> </span></p>
<p class=MsoNormal><span style='font-size:11.0pt'>ヤブツバキクラス(常緑広葉樹林)</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Camellietea japonicae</span><span lang=EN-US> Miyawaki <i>et</i>
Ohba 1963<br>
</span></span><span style='font-size:11.0pt'> リュウキュウガキ-クスノハガシワオーダー</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Diospyro maritimae-Mallotetalia philippensis</span><span lang=EN-US>
Fujiwara 1981<br>
</span></span><span style='font-size:11.0pt'> ナガミボチョウジ-リュウキュウガキ群団</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Psychotrio manilensis-Diospyrion maritimae</span><span lang=EN-US>
Niiro <i>et al.</i> 1974<br>
I need to extract as follows:
ヤブツバキクラス(常緑広葉樹林), Camellietea japonicae
リュウキュウガキ-クスノハガシワオーダー, Diospyro maritimae-Mallotetalia philippensis
ナガミボチョウジ-リュウキュウガキ群団, Psychotrio manilensis-Diospyrion maritimae
I tried as:
soup = BeautifulSoup(page, features="lxml")
rows = soup.find_all('span')
for row in rows:
print (row.text.strip().split(' ')[0])
But, it extracted as follows:
ヤブツバキクラス(常緑広葉樹林)
Camellietea
Camellietea
Miyawaki
リュウキュウガキ−クスノハガシワオーダー
Diospyro
Diospyro
Fujiwara
ナガミボチョウジ−リュウキュウガキ群団
Psychotrio
Psychotrio
Niiro
Upvotes: 0
Views: 61
Reputation: 84465
You can also use :first-child and attribute=value selectors with bs4 4.7.1
from bs4 import BeautifulSoup as bs
html = '''
<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Times New Roman","serif"'> </span></p>
<p class=MsoNormal><span style='font-size:11.0pt'>ヤブツバキクラス(常緑広葉樹林)</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Camellietea japonicae</span><span lang=EN-US> Miyawaki <i>et</i>
Ohba 1963<br>
</span></span><span style='font-size:11.0pt'> リュウキュウガキ-クスノハガシワオーダー</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Diospyro maritimae-Mallotetalia philippensis</span><span lang=EN-US>
Fujiwara 1981<br>
</span></span><span style='font-size:11.0pt'> ナガミボチョウジ-リュウキュウガキ群団</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Psychotrio manilensis-Diospyrion maritimae</span><span lang=EN-US>
Niiro <i>et al.</i> 1974<br>
'''
soup = bs(html, 'lxml')
print([i.text.strip() for i in item.select('span[style="font-size:11.0pt"], [lang=EN-US]:first-child')])
Upvotes: 0
Reputation: 57105
Step through the results and take the first two of every four spans:
for i in range(1, len(rows), 4):
print(rows[i].string.strip(),
list(rows[i+1].children)[1].string.strip())
#ヤブツバキクラス(常緑広葉樹林)Camellietea japonicae
#リュウキュウガキ-クスノハガシワオーダー Diospyro maritimae-Mallotetalia philippensis
#ナガミボチョウジ-リュウキュウガキ群団 Psychotrio manilensis-Diospyrion maritimae
Upvotes: 1