Reputation: 99
I am studying crawling with python3. I want to extract only text from html code.
ex) in html
<div class='titleArea'>
"~~~~~ text~~~~"
</div>
So, I wrote this code to extract text
title_temp = soup.findAll('div',class_='titleArea')
print(title_temp)
** I know that print(title_temp[0].text) but It does not matter
The result is
this picture's content is
[<div class='titleArea'>
@#$!$^!@#!@^#!$^!@#!@#!@#
</div>]
[<div class='titleArea'>
@#$!$^!@#!@^#!$^!@#!@#!@#
</div>]
***The reason why there is two List is repeated.
I don't want to that text.
What should I do?
I think it's utf-8 problem.
right?
So,
I wrote that
# -*- coding: utf-8 -*-
but, There was no effect.
Upvotes: 2
Views: 145
Reputation: 12168
import requests, bs4
r = requests.get('http://hri.co.kr/board/reportView.asp?firstDepth=1&secondDepth=1&numIdx=26865')
r.encoding='euc-kr'
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.find_all('div',class_='titleArea')
out:
[<div class="titleArea">
트럼프노믹스가 중국 경제에 미치는 영향
</div>]
The chartset
is in html head tag:
EDIT: More elegant way:
import requests, bs4
r = requests.get('http://hri.co.kr/board/reportView.asp?firstDepth=1&secondDepth=1&numIdx=26865')
r.encoding = r.apparent_encoding
This will automatically set encoding.
Upvotes: 4