Reputation: 91

Python: why site is not parsing?

I run this code on the website: juventus.com.I can parse the title

from urllib import urlopen
import re

webpage = urlopen('http://juventus.com').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle

output is:

['Welcome - Juventus.com']

but if try same code on another website return is nothing

from urllib import urlopen
import re

webpage = urlopen('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle

does anyone know why?

Upvotes: 0

Answers (3)

falsetru

Reputation: 369074

The content of http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq is: (modified to make it easy to read)

<script type='text/javascript'>
top.location.href = 'https://www.facebook.com/dialog/oauth?
client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&
state=07c9ba739d9340de596f64ae21754376&scope=email&0=publish_actions';
</script>

There's no title tag; no regular expression match.

Use selenium to evaluate javascript:

from selenium import webdriver

driver = webdriver.Firefox() # webdriver.PhantomJS()
driver.get('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq')
print driver.title
driver.quit()

Upvotes: 4

Vasile-Bogdan Raica

Reputation: 71

That's because the urlopen link contains a javascript redirection, it just doesn't contain a title tag.

This is what it contains:

<script type='text/javascript'>top.location.href = 'https://www.facebook.com/dialog/oauth?client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&state=0f9abed6de7412b5129a4d105a4be25f&scope=email&0=publish_actions';</script>

Also, I may be wrong, but you can't use urlopen to run javascript code if I recall right. You will need a different python module, can't remember its name now, but there is is a module if I recall that can run the javascript code, but will need a gui for it and a valid browser to use, eg. firefox ...

Upvotes: 0

Marcin

Reputation: 49826

Because the regex does not match the title tag on the page it redirects to, and it is redirected.

Your code should (a) be using beautifulsoup, or if you know the output will be well-formed xml, lxml (or lxml with beautifulsoup backend) to parse html, and not regexes (b) be using requests, a simpler module for making HTTP requests, which can handle redirects transparently.

Upvotes: 0

Python: why site is not parsing?

Answers (3)

Related Questions