Reputation: 20
I'm trying to extract some climate data from table using pandas.read_html() but it returns entire rows empty. I think it has to do with some desire from the webmaster to prevent webscraping but I might be wrong. I also tried using bs4, but had the same results.
pandas:
import pandas as pd
dfs = pd.read_html('https://www.tutiempo.net/clima/03-2000/ws-879380.html',match='.+', flavor='bs4')
df = dfs[2]
df
output
Día T TM Tm SLP H PP VV V VM VG RA SN TS FG
0 1 9.9 15 6 1007.4 55 0.76 16.9 11.1 18.3 - NaN NaN NaN NaN
1 2 13.5 19 8.4 1006.9 45 0 17.9 13.3 24.1 51.9 NaN NaN NaN NaN
2 3 9.6 18.9 7 1004.8 77 0.76 16.4 17.4 37 50 o NaN NaN NaN
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o NaN NaN NaN
4 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 7 16.6 21 12.6 1000.0 67 0 16.9 20 64.8 85.2 NaN NaN NaN NaN
7 8 12.9 21.2 7.8 1001.7 74 - 16.6 19.1 44.3 72.2 o NaN NaN NaN
8 9 11.3 19 8.4 1005.4 83 1.02 15.9 12 29.4 - o NaN NaN NaN
9 10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o NaN NaN NaN
11 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o NaN NaN NaN
12 13 7.5 12 4 1007.5 85 0.25 17.9 9.4 22.2 - NaN NaN NaN NaN
13 14 7.8 12 4.8 995.4 91 0 15.1 16.5 27.8 - o NaN NaN NaN
14 15 6.5 8 5 984.9 79 2.03 16.6 38.2 48.2 63 NaN NaN NaN NaN
bs4:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://www.tutiempo.net/clima/01-2000/ws-879380.html').read()
soup = bs.BeautifulSoup(sauce,'lxml')
table = soup.find("table", {"class": "medias mensuales numspan"})
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
output
['1', '8.6', '13.3', '4.8', '996.5', '64', '0', '18.3', '39.6', '59.1', '-', '\xa0', '\xa0', '\xa0', '\xa0']
['2', '9.4', '13.8', '5.8', '999.4', '69', '0.76', '20.6', '17', '55.4', '-', 'o', '\xa0', '\xa0', '\xa0']
['3', '8', '12.4', '6', '1001.1', '79', '1.27', '18.3', '21.1', '31.3', '-', 'o', '\xa0', '\xa0', '\xa0']
['4', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['5', '', '', '', '', '', '', '', '', '', '', '\xa0', '\xa0', '\xa0', '\xa0']
['6', '', '', '', '', '', '', '', '', '', '', '\xa0', '\xa0', '\xa0', '\xa0']
['7', '8.3', '16.8', '4', '984.2', '64', '5.08', '20.9', '24.8', '74.1', '-', 'o', '\xa0', '\xa0', '\xa0']
['8', '7.3', '13.2', '3.5', '986.3', '65', '0.51', '15', '32.6', '55.4', '-', 'o', '\xa0', '\xa0', '\xa0']
['9', '4.4', '12.4', '0.6', '988.4', '81', '4.06', '14.3', '28.2', '51.9', '-', 'o', 'o', '\xa0', '\xa0']
['10', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['11', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['12', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['13', '8.8', '10.3', '6', '1001.9', '78', '0.25', '18', '57', '70.2', '-', '\xa0', '\xa0', '\xa0', '\xa0']
['14', '9.3', '11', '7.8', '1003.8', '76', '0', '18.3', '58.2', '64.8', '-', '\xa0', '\xa0', '\xa0', '\xa0']
if you check the website, the rows are complete. Anything helps.
Best regards, Sir Ernest Shackleton
Upvotes: 0
Views: 431
Reputation: 1794
They are using <span>
tags with attached styles. The attached styles have a content
attribute they are using to build the value in cells that appear empty to bs4.
The data is all there in the HTML, but you will need to process the styles to get it:
A quick and dirty fix would be to assume the styles don't change and write a pre-process replacement something like:
str = str.replace('<span class="ntlm">', '1')
or str = str.replace('<span class="ntzb">', '5')
A better solution would be to process the styles with either a css engine or regex, build the map each time you load the page, and then apply the mappings to substitute the text.
Upvotes: 1
Reputation: 89
The issue is with how the data is displayed on the website. If you inspect its element you can see that some of the data you want are stored with other data. I'm not sure if I explained this correctly but I think its best if you take a look at it for yourself.
Upvotes: 0