shackleton
shackleton

Reputation: 20

SCRAPING - Pandas read_html and bs4 returns mutiple empty rows

I'm trying to extract some climate data from table using pandas.read_html() but it returns entire rows empty. I think it has to do with some desire from the webmaster to prevent webscraping but I might be wrong. I also tried using bs4, but had the same results.

pandas:

import pandas as pd

dfs = pd.read_html('https://www.tutiempo.net/clima/03-2000/ws-879380.html',match='.+', flavor='bs4')

df = dfs[2]
df

output

    Día T   TM  Tm  SLP H   PP  VV  V   VM  VG  RA  SN  TS  FG
0   1   9.9 15  6   1007.4  55  0.76    16.9    11.1    18.3    -   NaN NaN NaN NaN
1   2   13.5    19  8.4 1006.9  45  0   17.9    13.3    24.1    51.9    NaN NaN NaN NaN
2   3   9.6 18.9    7   1004.8  77  0.76    16.4    17.4    37  50  o   NaN NaN NaN
3   4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
4   5   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5   6   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6   7   16.6    21  12.6    1000.0  67  0   16.9    20  64.8    85.2    NaN NaN NaN NaN
7   8   12.9    21.2    7.8 1001.7  74  -   16.6    19.1    44.3    72.2    o   NaN NaN NaN
8   9   11.3    19  8.4 1005.4  83  1.02    15.9    12  29.4    -   o   NaN NaN NaN
9   10  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10  11  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
11  12  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
12  13  7.5 12  4   1007.5  85  0.25    17.9    9.4 22.2    -   NaN NaN NaN NaN
13  14  7.8 12  4.8 995.4   91  0   15.1    16.5    27.8    -   o   NaN NaN NaN
14  15  6.5 8   5   984.9   79  2.03    16.6    38.2    48.2    63  NaN NaN NaN NaN

bs4:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://www.tutiempo.net/clima/01-2000/ws-879380.html').read()
soup = bs.BeautifulSoup(sauce,'lxml')

table = soup.find("table", {"class": "medias mensuales numspan"})

table_rows = table.find_all('tr')

for tr in table_rows:
  td = tr.find_all('td') 
  row = [i.text for i in td]
  print(row)

output

['1', '8.6', '13.3', '4.8', '996.5', '64', '0', '18.3', '39.6', '59.1', '-', '\xa0', '\xa0', '\xa0', '\xa0']
['2', '9.4', '13.8', '5.8', '999.4', '69', '0.76', '20.6', '17', '55.4', '-', 'o', '\xa0', '\xa0', '\xa0']
['3', '8', '12.4', '6', '1001.1', '79', '1.27', '18.3', '21.1', '31.3', '-', 'o', '\xa0', '\xa0', '\xa0']
['4', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['5', '', '', '', '', '', '', '', '', '', '', '\xa0', '\xa0', '\xa0', '\xa0']
['6', '', '', '', '', '', '', '', '', '', '', '\xa0', '\xa0', '\xa0', '\xa0']
['7', '8.3', '16.8', '4', '984.2', '64', '5.08', '20.9', '24.8', '74.1', '-', 'o', '\xa0', '\xa0', '\xa0']
['8', '7.3', '13.2', '3.5', '986.3', '65', '0.51', '15', '32.6', '55.4', '-', 'o', '\xa0', '\xa0', '\xa0']
['9', '4.4', '12.4', '0.6', '988.4', '81', '4.06', '14.3', '28.2', '51.9', '-', 'o', 'o', '\xa0', '\xa0']
['10', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['11', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['12', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['13', '8.8', '10.3', '6', '1001.9', '78', '0.25', '18', '57', '70.2', '-', '\xa0', '\xa0', '\xa0', '\xa0']
['14', '9.3', '11', '7.8', '1003.8', '76', '0', '18.3', '58.2', '64.8', '-', '\xa0', '\xa0', '\xa0', '\xa0']

if you check the website, the rows are complete. Anything helps.

Best regards, Sir Ernest Shackleton

Upvotes: 0

Views: 431

Answers (2)

kerasbaz
kerasbaz

Reputation: 1794

They are using <span> tags with attached styles. The attached styles have a content attribute they are using to build the value in cells that appear empty to bs4.

The data is all there in the HTML, but you will need to process the styles to get it:

enter image description here

A quick and dirty fix would be to assume the styles don't change and write a pre-process replacement something like:

str = str.replace('<span class="ntlm">', '1') or str = str.replace('<span class="ntzb">', '5')

A better solution would be to process the styles with either a css engine or regex, build the map each time you load the page, and then apply the mappings to substitute the text.

Upvotes: 1

Rithwik Babu
Rithwik Babu

Reputation: 89

The issue is with how the data is displayed on the website. If you inspect its element you can see that some of the data you want are stored with other data. I'm not sure if I explained this correctly but I think its best if you take a look at it for yourself.

Upvotes: 0

Related Questions