Reputation: 2446
I am trying to read a page with urllib2, in order to extract data from the page. Part of the page is generated per load, and when I am reading the url with urllib2 this part is not in the html I am getting.
The url is http://nametrends.net/name.php?name=Ruby , and I am trying to get the table that is generated for the graph. For example:
<div aria-label="A tabular representation of the data in the chart." style="position: absolute; left: -10000px; top: auto; width: 1px; height: 1px; overflow: hidden;">
<table>
<tbody>
<tr><td>Sat Feb 01 1947 00:00:00 GMT-0500 (EST)</td><td>0.048</td><td>0</td></tr>
</tbody>
</table>
</div>
My current code is:
import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request('http://nametrends.net/name.php?name=Ruby')
response = urllib2.urlopen(req)
the_page = response.read()
html = BeautifulSoup(the_page)
print "tabular" in html
for table in html.find_all('table'):
print 1
it does not find that table , and there is no div in the html with the text tabular (which is the label of the div that contains the table)
Upvotes: 2
Views: 1194
Reputation: 4299
At the start I would go:
bs = BeautifulSoup(the_page)
html = bs.html
Your code doesn't look to bad. going...
print str(BeautifulSoup(the_page))
will show what Beautiful soup parsed the page into.
Upvotes: 0
Reputation: 20563
If alternative other than urllib2 is possible, Selenium can perform this kind of task with ease, with actual browser simulation:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'http://nametrends.net/name.php?name=Ruby'
driver = webdriver.Firefox()
driver.get(url)
# wait until 'tabular' appears on browser
assert 'tabular' not in driver.page_source
html = BeautifulSoup(driver.page_source)
for table in html.find_all('table'):
print table
Upvotes: 2
Reputation: 474003
The table is filled with the data returned by the additional XHR request to getfrequencyjson.php
endpoint. You need to make that request in your code and parse the JSON data:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
session.get('http://nametrends.net/name.php', params={'name': 'ruby'}, headers=headers)
response = session.get('http://nametrends.net/chartdata/getfrequencyjson.php', params={'name': 'ruby'})
results = response.json()
print results
Upvotes: 4