PhilliP
PhilliP

Reputation: 73

trying to scraping a page, but there is a missing

index_cd = 'KPI200'
page_n = 1
naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code' + index_cd + '&page=' + str(page_n)

from urllib.request import urlopen
source = urlopen(naver_index).read()
import bs4
source = bs4.BeautifulSoup(source, 'lxml')
td = source.find_all('td')
len(td)
# /html/body/div/table[1]/tbody/tr[3]/td[1]  # this is XPath
source.find_all('table')[0].find_all('tr')[2].find_all('td')[0]

I thought the output will be like this: <td class="date">2020.09.29</td>

But the result is that this one: <td class="date"> </td>

There is a '\xa0' between <td class="date" and </td>.

I need to extract that date. How to solve this situation?

Upvotes: 0

Views: 43

Answers (1)

Sushil
Sushil

Reputation: 5531

The problem is with the url that u have provided. U have missed an = after code.

Change naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code' + index_cd + '&page=' + str(page_n) to naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code=' + index_cd + '&page=' + str(page_n)

This is the working code:

index_cd = 'KPI200'
page_n = 1
naver_index = 'http://finance.naver.com/sise/sise_index_day.nhn?code=' + index_cd + '&page=' + str(page_n)

from urllib.request import urlopen
source = urlopen(naver_index).read()
import bs4
source = bs4.BeautifulSoup(source, 'lxml')
td = source.find_all('td')
len(td)
# /html/body/div/table[1]/tbody/tr[3]/td[1]  # this is XPath
print(source.find_all('table')[0].find_all('tr')[2].find_all('td')[0])

Output:

<td class="date">2020.09.29</td>

If u only want the date to be displayed, then change the last line to:

print(source.find_all('table')[0].find_all('tr')[2].find_all('td')[0].text)

Output:

2020.09.29

Hope that this helps u!

Upvotes: 1

Related Questions