Jourdans
Jourdans

Reputation: 53

Fetching "hidden" data from web page using python

I have written a Python program using tidal data from the French hydrographic office. Presently, I open this site with Mozilla Firefox, under Windows-10: http://maree.shom.fr/harbor/BREST/wl/0?date=2016-10-31&utc=standard (selecting "Hauteur d'eau heure par heure", and setting harbor and date). Then I right-click, select from the pop up menu "save as", with the text file option selected and get a file where the relevant tables are present, e.g.:

Lundi 31 octobre 2016
00:00   01:00   02:00   03:00   04:00   05:00
1.79m   2.76m   4.09m   5.43m   6.45m   6.87m
06:00   07:00   08:00   09:00   10:00   11:00
6.56m   5.64m   4.42m   3.21m   2.22m   1.61m...

My Python app extracts the data from this file using regex. I would like to automate this process (open the page from the Python app and fetch the relevant content), but I have not found how to do this. The html source code of the web page (read by right-clicking in Firefox) does not contain the tidal tables. I tried using selenium, but all I get is the same useless html code. Is there any way, e.g. for emulating what Firefox does when the command "save as *.text" is executed?

Upvotes: 1

Views: 1013

Answers (1)

el3ien
el3ien

Reputation: 5415

This does not emulate what firefox does, but it will give you the data in a dictionary if you want.
The idea is to find the <tbody> tag in the html, and then split the rows. Then the times are <th> tags and the heights are <td> tags.
So a couple of loops and zip does it.
In this example the html is in a file. You could also have it in a variable.

f = open("html.txt","r").read()

table = f[f.find("<tbody>"):f.find("</tbody>")]

rows = table.split("<tr>")

data = []

for i in range(1,len(rows),2):
    data.extend(zip(rows[i].split("<th>")[1:],rows[i+1].split("<td>")[1:]))

for i in range(0,len(data)):
    x = data[i][0]
    y = data[i][1]
    data[i] = x[:x.find("<")],y[:y.find("<")]

print dict(data)

Update:

The reason you dont see the table in the html, is because its javascript generated. So we need something like selenium, as you allready tried.
I do not know if the owners of that website like you to scrabe it, so you could ask them, or see if there is an API.
That said, this is how you can scrape javascript generated content.
I installed PhantomJS for the webdriver.

from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path="/usr/bin/phantomjs")
driver.get(website_link)
time.sleep(10) # wait as long as it takes for the data to be loaded
print(driver.find_element_by_tag_name("table").text)
driver.close()

Upvotes: 1

Related Questions