marlon
marlon

Reputation: 7653

How to extract a text summary from a wikipedia term entry in html tags?

enter image description here

In the attached html screenshot, I want to get the text summary in the 'lemma-summary' section. It's usually the first sentence of a wikipedia entry. This is a Chinese wikipedia entry. I used this code through BeautifulSoup

summaries = doc.getElements('div', attr='label-module', value='para').text 

But this returns all text sections of the html page without using the 'lemma-summary'. If I do this:

summary = soup.select(".lemma-summary")

This does gives the right section (only the summary section), but it returns a ResultSet object, and I don't know how to get down to the exact text part.

How to extract the text part from this tag?

The URL of the page is here:

https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3

I want to extract this summary text:

"ika是深圳缇卡基因美容生物科技有限公司的一个化妆品品牌。"

Upvotes: 0

Views: 90

Answers (1)

Jonathan Leon
Jonathan Leon

Reputation: 5648

I had to use selenium to get the page to load. If you can get the right html without selenium that work too.

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")


driver = webdriver.Chrome('chromedriver.exe', options=chrome_options)
url = 'https://baike.baidu.com/item/tt%E8%AF%AD%E9%9F%B3'
driver.get(url)
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

This

soup.find('div', attrs={'class': 'para', 'label-module': 'para'}).text

gets you

'TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。\n[1]\xa0\n'

and this

summary = soup.select(".lemma-summary")
for s in summary:
    print(s.text)

gets you

TT语音App,提供游戏组队开黑、职业电竞培养、达人娱乐互动等游戏社交场景。
[1]  

Upvotes: 1

Related Questions