sam
sam

Reputation: 19194

get all links from html even with show more link

I am using python and beautifulsoup for html parsing.

I am using the following code :

from BeautifulSoup import BeautifulSoup
import urllib2
import re

url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"

main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)

for a in soup.findAll('a',href=True):
    print a[href]

but I am not getting output links like : http://www.wikipathways.org/index.php/Pathway:WP26

and also imp thing is there are 107 pathways. but I will not get all the links as other lins depends on "show links" at the bottom of the page.

so, how can I get all the links (107 links) from that url?

Upvotes: 1

Views: 563

Answers (2)

TerryA
TerryA

Reputation: 60014

Your problem is line 8, content = url.read(). You're not actually reading the webpage, you're actually just doing nothing (If anything, you should be getting an error).

main_url is what you want to read, so change line 8 to:

content = main_url.read()

You also have another error, print a[href]. href should be a string, so it should be:

print a['href']

Upvotes: 2

Mahdi Yusuf
Mahdi Yusuf

Reputation: 21038

I would suggest using lxml its faster and better for parsing html worth investing the time to learn it.

from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')

That should get you going.

Upvotes: 1

Related Questions