SCRAPING - Pandas read_html and bs4 returns mutiple empty rows

Question

I'm trying to extract some climate data from table using pandas.read_html() but it returns entire rows empty. I think it has to do with some desire from the webmaster to prevent webscraping but I might be wrong. I also tried using bs4, but had the same results.

pandas:

import pandas as pd

dfs = pd.read_html('https://www.tutiempo.net/clima/03-2000/ws-879380.html',match='.+', flavor='bs4')

df = dfs[2]
df

output

    Día T   TM  Tm  SLP H   PP  VV  V   VM  VG  RA  SN  TS  FG
0   1   9.9 15  6   1007.4  55  0.76    16.9    11.1    18.3    -   NaN NaN NaN NaN
1   2   13.5    19  8.4 1006.9  45  0   17.9    13.3    24.1    51.9    NaN NaN NaN NaN
2   3   9.6 18.9    7   1004.8  77  0.76    16.4    17.4    37  50  o   NaN NaN NaN
3   4   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
4   5   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5   6   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6   7   16.6    21  12.6    1000.0  67  0   16.9    20  64.8    85.2    NaN NaN NaN NaN
7   8   12.9    21.2    7.8 1001.7  74  -   16.6    19.1    44.3    72.2    o   NaN NaN NaN
8   9   11.3    19  8.4 1005.4  83  1.02    15.9    12  29.4    -   o   NaN NaN NaN
9   10  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10  11  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
11  12  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN o   NaN NaN NaN
12  13  7.5 12  4   1007.5  85  0.25    17.9    9.4 22.2    -   NaN NaN NaN NaN
13  14  7.8 12  4.8 995.4   91  0   15.1    16.5    27.8    -   o   NaN NaN NaN
14  15  6.5 8   5   984.9   79  2.03    16.6    38.2    48.2    63  NaN NaN NaN NaN

bs4:

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://www.tutiempo.net/clima/01-2000/ws-879380.html').read()
soup = bs.BeautifulSoup(sauce,'lxml')

table = soup.find("table", {"class": "medias mensuales numspan"})

table_rows = table.find_all('tr')

for tr in table_rows:
  td = tr.find_all('td') 
  row = [i.text for i in td]
  print(row)

output

['1', '8.6', '13.3', '4.8', '996.5', '64', '0', '18.3', '39.6', '59.1', '-', '\xa0', '\xa0', '\xa0', '\xa0']
['2', '9.4', '13.8', '5.8', '999.4', '69', '0.76', '20.6', '17', '55.4', '-', 'o', '\xa0', '\xa0', '\xa0']
['3', '8', '12.4', '6', '1001.1', '79', '1.27', '18.3', '21.1', '31.3', '-', 'o', '\xa0', '\xa0', '\xa0']
['4', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['5', '', '', '', '', '', '', '', '', '', '', '\xa0', '\xa0', '\xa0', '\xa0']
['6', '', '', '', '', '', '', '', '', '', '', '\xa0', '\xa0', '\xa0', '\xa0']
['7', '8.3', '16.8', '4', '984.2', '64', '5.08', '20.9', '24.8', '74.1', '-', 'o', '\xa0', '\xa0', '\xa0']
['8', '7.3', '13.2', '3.5', '986.3', '65', '0.51', '15', '32.6', '55.4', '-', 'o', '\xa0', '\xa0', '\xa0']
['9', '4.4', '12.4', '0.6', '988.4', '81', '4.06', '14.3', '28.2', '51.9', '-', 'o', 'o', '\xa0', '\xa0']
['10', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['11', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['12', '', '', '', '', '', '', '', '', '', '', 'o', '\xa0', '\xa0', '\xa0']
['13', '8.8', '10.3', '6', '1001.9', '78', '0.25', '18', '57', '70.2', '-', '\xa0', '\xa0', '\xa0', '\xa0']
['14', '9.3', '11', '7.8', '1003.8', '76', '0', '18.3', '58.2', '64.8', '-', '\xa0', '\xa0', '\xa0', '\xa0']

if you check the website, the rows are complete. Anything helps.

Best regards, Sir Ernest Shackleton

kerasbaz · Accepted Answer

They are using tags with attached styles. The attached styles have a content attribute they are using to build the value in cells that appear empty to bs4.

The data is all there in the HTML, but you will need to process the styles to get it:

A quick and dirty fix would be to assume the styles don't change and write a pre-process replacement something like:

str = str.replace('', '1') or str = str.replace('', '5')

A better solution would be to process the styles with either a css engine or regex, build the map each time you load the page, and then apply the mappings to substitute the text.

SCRAPING - Pandas read_html and bs4 returns mutiple empty rows

Answers (2)

Related Questions