Gustavo Amarante
Gustavo Amarante

Reputation: 242

Read complex html table with python

I am trying to bring this table into a pandas DataFrame. I've tried using pandas read_html, I've tried using requests and bs4. I would like to grab the whole table as we see it, but in the html code the table is broken into 3 blocks. Still, I have not figured out how to pin down each one of them.

Here is a starting code:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-enUS.asp'

response = requests.get(url, params={'Data': '08/01/2018', 'Mercadoria': 'DI1'})

soup = BeautifulSoup(response.text, 'html.parser')

Upvotes: 1

Views: 513

Answers (1)

alecxe
alecxe

Reputation: 473843

One important thing that may help you solve it is the fact that the response you get with requests here does not really contain the rendered table element, but it definitely contains the desired data.

The problem is, the page requires javascript to have the table rendered. You may notice that your data is inside a script element:

<script type="text/javascript">
 var MercFut0 = "";
    var MercFut1 = "";
    var MercFut2 = "";
    var MercFut3 = "";

    MercFut0 = MercFut0 + '<table class="secondary">';
    MercFut0 = MercFut0 + '<tr><td></td></tr>';
    MercFut1 = MercFut1 + '<table class="secondary" id="teste">';
    MercFut1 = MercFut1 + '<tr style="height: 120px;">';
    MercFut2 = MercFut2 + '<table class="secondary" id="teste">';
    MercFut2 = MercFut2 + '<tr style="height: 120px;">';

    MercFut2 = MercFut2 + '<th class="text-center">Open Interest opening*</th>';
    MercFut2 = MercFut2 + '<th class="text-center">Open Interest closing**</th>';
    ...
    MercFut1 = MercFut1 + '</tr>';

    MercFut0 = MercFut0 + '</table>';
    MercFut1 = MercFut1 + '</table>';
    MercFut2 = MercFut2 + '</table>';
    MercFut3 = MercFut3 + '</table>';
    MercadoFut0.innerHTML = MercFut3;
    MercadoFut1.innerHTML = MercFut0;
    MercadoFut2.innerHTML = MercFut1;
    tableShow(2,false);
    tableShow(9,false);
</script>

At this point, the easiest approach would probably be to use smth like selenium to simply has this page rendered.

Or, you may try getting this script and executing it with, for example, pyexecjs.

Something along these lines:

import execjs

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = 'http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-sistema-pregao-enUS.asp'

response = requests.get(url, params={'Data': '08/01/2018', 'Mercadoria': 'DI1'})

soup = BeautifulSoup(response.text, 'html.parser')

# compile the desired table html from the data
script = soup.find("script", text=lambda text: text and 'tableShow' in text and "<table" in text).get_text()
script = """
var MercadoFut0 = {}, 
    MercadoFut1 = {}, 
    MercadoFut2 = {};
var tableShow = function () {};

function getTables() {
    %s
    return [MercFut1, MercFut2, MercFut3];
}
""" % script

ctx = execjs.compile(script)
table1, table2, table3 = ctx.call("getTables")

# parse tables into dataframes
df1 = pd.read_html(table1)[0]
df2 = pd.read_html(table2)[0]
df3 = pd.read_html(table3)[0]

print(df1)
print(df2)
print(df3)

Then, you can "concat" df1 and df2 which should get you the desired table:

df = pd.concat([df2, df1], axis=1)

pd.set_option('display.expand_frame_repr', False)
print(df)

Prints:

                        0                        1                 2               3                      4                       0                       1              2              3              4              5           6                 7        8         9           10
0   Open Interest opening*  Open Interest closing**  Number of Trades  Trading Volume  Financial Volume (R$)  Previous Settlement***  Indexed Settlement****  Opening Price  Minimum Price  Maximum Price  Average Price  Last Price  Settlement Price   Change  Last Bid  Last Offer
1                   442761                   442761                 0               0                      0                99999.99                       -          0.000          0.000          0.000          0.000       0.000         100000.00    0.01+     0.000       0.000
2                   760332                   792464               147          114160            11351487370                99434.83                99434.83          6.404          6.403          6.410          6.406       6.407          99434.72    0.11-     6.404       6.408
3                  2343218                  2377609               183           99890             9885888562                98967.40                98967.40          6.429          6.421          6.435          6.423       6.425          98967.53    0.13+     0.000       6.425
...
38                   72923                    74133               280            3920              126375887                32296.91                32296.91         11.510         11.450         11.600         11.521      11.580          32063.31  233.60-    11.590      11.610
39                    5325                     5325                 0               0                      0                28878.71                28878.71          0.000          0.000          0.000          0.000       0.000          28649.55  229.16-     0.000      11.680

Upvotes: 1

Related Questions