Reputation: 73
index_cd = 'KPI200'
page_n = 1
naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code' + index_cd + '&page=' + str(page_n)
from urllib.request import urlopen
source = urlopen(naver_index).read()
import bs4
source = bs4.BeautifulSoup(source, 'lxml')
td = source.find_all('td')
len(td)
# /html/body/div/table[1]/tbody/tr[3]/td[1] # this is XPath
source.find_all('table')[0].find_all('tr')[2].find_all('td')[0]
I thought the output will be like this: <td class="date">2020.09.29</td>
But the result is that this one: <td class="date"> </td>
There is a '\xa0'
between <td class="date"
and </td>
.
I need to extract that date. How to solve this situation?
Upvotes: 0
Views: 43
Reputation: 5531
The problem is with the url
that u have provided. U have missed an =
after code
.
Change naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code' + index_cd + '&page=' + str(page_n)
to naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code=' + index_cd + '&page=' + str(page_n)
This is the working code:
index_cd = 'KPI200'
page_n = 1
naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code=' + index_cd + '&page=' + str(page_n)
from urllib.request import urlopen
source = urlopen(naver_index).read()
import bs4
source = bs4.BeautifulSoup(source, 'lxml')
td = source.find_all('td')
len(td)
# /html/body/div/table[1]/tbody/tr[3]/td[1] # this is XPath
print(source.find_all('table')[0].find_all('tr')[2].find_all('td')[0])
Output:
<td class="date">2020.09.29</td>
If u only want the date to be displayed, then change the last line to:
print(source.find_all('table')[0].find_all('tr')[2].find_all('td')[0].text)
Output:
2020.09.29
Hope that this helps u!
Upvotes: 1