Quantico
Quantico

Reputation: 2446

python urllib2 - reading a page after all scripts ran

I am trying to read a page with urllib2, in order to extract data from the page. Part of the page is generated per load, and when I am reading the url with urllib2 this part is not in the html I am getting.

The url is http://nametrends.net/name.php?name=Ruby , and I am trying to get the table that is generated for the graph. For example:

<div aria-label="A tabular representation of the data in the chart." style="position: absolute; left: -10000px; top: auto; width: 1px; height: 1px; overflow: hidden;">
        <table>
            <tbody>
            <tr><td>Sat Feb 01 1947 00:00:00 GMT-0500 (EST)</td><td>0.048</td><td>0</td></tr>
            </tbody>
         </table>
</div>

My current code is:

import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request('http://nametrends.net/name.php?name=Ruby')
response = urllib2.urlopen(req)
the_page = response.read()

html = BeautifulSoup(the_page)
print "tabular" in html
for table in html.find_all('table'):
    print 1

it does not find that table , and there is no div in the html with the text tabular (which is the label of the div that contains the table)

Upvotes: 2

Views: 1194

Answers (3)

andrew pate
andrew pate

Reputation: 4299

At the start I would go:

bs = BeautifulSoup(the_page)
html = bs.html

Your code doesn't look to bad. going...

print str(BeautifulSoup(the_page))

will show what Beautiful soup parsed the page into.

Upvotes: 0

Anzel
Anzel

Reputation: 20563

If alternative other than urllib2 is possible, Selenium can perform this kind of task with ease, with actual browser simulation:

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'http://nametrends.net/name.php?name=Ruby'
driver = webdriver.Firefox()
driver.get(url)
# wait until 'tabular' appears on browser
assert 'tabular' not in driver.page_source

html = BeautifulSoup(driver.page_source)
for table in html.find_all('table'):
    print table

Upvotes: 2

alecxe
alecxe

Reputation: 474003

The table is filled with the data returned by the additional XHR request to getfrequencyjson.php endpoint. You need to make that request in your code and parse the JSON data:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}

with requests.Session() as session:
    session.headers = headers
    session.get('http://nametrends.net/name.php', params={'name': 'ruby'}, headers=headers)

    response = session.get('http://nametrends.net/chartdata/getfrequencyjson.php', params={'name': 'ruby'})
    results = response.json()
    print results

Upvotes: 4

Related Questions