user3116297
user3116297

Reputation: 51

BeautifulSoup: Within Webpage

I have just installed BeautifulSoup. I can extract all the links using BS, but i can't use it to navigate WITHIN the webpage. Is there a way give the main URL and extract all the information from the links in the webpage?

Upvotes: 0

Views: 40

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

I found lxml to be more efficient, consistent to use, and even robust than BeautifulSoup.

In a number of cases (may related to encodings?) BeautifulSoup would fail badly with parsing some broken web pages for me. The lxml result was close to what web browsers see, and worked much better on these broken pages.

Extracting links is trivial with either:

BeautifulSoup:

for a in soup.findAll('a'):
    # Do something with a['href']

lxml:

 for href in doc.xpath('//a/@href'):
     # Do something with "href"

alternate lxml:

 for a in doc.xpath('//a'):
     # Do something with a['href']

Please see the documentation on how to parse the document.

Upvotes: 0

alecxe
alecxe

Reputation: 474281

You still can use BeautifulSoup for extracting links from a web page. For following them, you can either stick with urllib2 or use requests.

Another option, that could better fit your needs is to use Scrapy web-scraping framework. It has link extracting mechanism built-in:

LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.

Hope that helps.

Upvotes: 1

Related Questions