Reputation: 19194
I am using python and beautifulsoup for html parsing.
I am using the following code :
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"
main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
print a[href]
but I am not getting output links like : http://www.wikipathways.org/index.php/Pathway:WP26
and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.
so, how can I get all the links (107 links) from that url?
Upvotes: 1
Views: 563
Reputation: 60014
Your problem is line 8, content = url.read()
. You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).
main_url
is what you want to read, so change line 8 to:
content = main_url.read()
You also have another error, print a[href]
. href
should be a string, so it should be:
print a['href']
Upvotes: 2
Reputation: 21038
I would suggest using lxml
its faster and better for parsing html worth investing the time to learn it.
from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')
That should get you going.
Upvotes: 1