iceagle
iceagle

Reputation: 21

pandas read_html not wait for page loading

I am trying to read a table at an URL using pandas read_html, but the table I am interested in is loaded after the other parts of the page, so the dataframe I get is like below instead of the actual content:

ColumnA     |     ColumnB

Still loading |    Still loading

So is there a way to tell read_html to wait until the table is loaded completely and then read the table?

Upvotes: 1

Views: 1178

Answers (1)

j6m8
j6m8

Reputation: 2409

There's no way we can answer for sure without a specific code example, but you should be aware that read_html crawls the static version of the HTML as it is served; it doesn't wait for JavaScript to execute (likely what you're seeing happen in the browser when the table "loads") because the HTML crawler doesn't execute JavaScript at all.

You can also read more about common HTML-scraping gotchas with pandas here, though these will be more relevant for performance rather than waiting for a secondary page update.

If you need to incorporate javascript updates into your crawl, you may need to look into a headless browser like Selenium [docs] or headless-chrome [related question].

Upvotes: 1

Related Questions