Reputation: 91
I run this code on the website: juventus.com.I can parse the title
from urllib import urlopen
import re
webpage = urlopen('http://juventus.com').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle
output is:
['Welcome - Juventus.com']
but if try same code on another website return is nothing
from urllib import urlopen
import re
webpage = urlopen('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle
does anyone know why?
Upvotes: 0
Views: 258
Reputation: 369074
The content of http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq
is: (modified to make it easy to read)
<script type='text/javascript'>
top.location.href = 'https://www.facebook.com/dialog/oauth?
client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&
state=07c9ba739d9340de596f64ae21754376&scope=email&0=publish_actions';
</script>
There's no title tag; no regular expression match.
Use selenium to evaluate javascript:
from selenium import webdriver
driver = webdriver.Firefox() # webdriver.PhantomJS()
driver.get('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq')
print driver.title
driver.quit()
Upvotes: 4
Reputation: 71
That's because the urlopen link contains a javascript redirection, it just doesn't contain a title tag.
This is what it contains:
<script type='text/javascript'>top.location.href = 'https://www.facebook.com/dialog/oauth?client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&state=0f9abed6de7412b5129a4d105a4be25f&scope=email&0=publish_actions';</script>
Also, I may be wrong, but you can't use urlopen to run javascript code if I recall right. You will need a different python module, can't remember its name now, but there is is a module if I recall that can run the javascript code, but will need a gui for it and a valid browser to use, eg. firefox ...
Upvotes: 0
Reputation: 49826
Because the regex does not match the title tag on the page it redirects to, and it is redirected.
Your code should (a) be using beautifulsoup, or if you know the output will be well-formed xml, lxml (or lxml with beautifulsoup backend) to parse html, and not regexes (b) be using requests, a simpler module for making HTTP requests, which can handle redirects transparently.
Upvotes: 0