Reputation: 7653
In the attached html screenshot, I want to get the text summary in the 'lemma-summary' section. It's usually the first sentence of a wikipedia entry. This is a Chinese wikipedia entry. I used this code through BeautifulSoup
summaries = doc.getElements('div', attr='label-module', value='para').text
But this returns all text sections of the html page without using the 'lemma-summary'. If I do this:
summary = soup.select(".lemma-summary")
This does gives the right section (only the summary section), but it returns a ResultSet object, and I don't know how to get down to the exact text part.
How to extract the text part from this tag?
The URL of the page is here:
https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3
I want to extract this summary text:
"ika是深圳缇卡基因美容生物科技有限公司的一个化妆品品牌。"
Upvotes: 0
Views: 90
Reputation: 5648
I had to use selenium to get the page to load. If you can get the right html without selenium that work too.
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
url = 'https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3'
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
This
soup.find('div', attrs={'class': 'para', 'label-module': 'para'}).text
gets you
'TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。\n[1]\xa0\n'
and this
summary = soup.select(".lemma-summary")
for s in summary:
print(s.text)
gets you
TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。
[1]
Upvotes: 1