Reputation: 51
I have just installed BeautifulSoup. I can extract all the links using BS, but i can't use it to navigate WITHIN the webpage. Is there a way give the main URL and extract all the information from the links in the webpage?
Upvotes: 0
Views: 40
Reputation: 77505
I found lxml
to be more efficient, consistent to use, and even robust than BeautifulSoup.
In a number of cases (may related to encodings?) BeautifulSoup would fail badly with parsing some broken web pages for me. The lxml
result was close to what web browsers see, and worked much better on these broken pages.
Extracting links is trivial with either:
BeautifulSoup:
for a in soup.findAll('a'):
# Do something with a['href']
lxml
:
for href in doc.xpath('//a/@href'):
# Do something with "href"
alternate lxml
:
for a in doc.xpath('//a'):
# Do something with a['href']
Please see the documentation on how to parse the document.
Upvotes: 0
Reputation: 474281
You still can use BeautifulSoup
for extracting links from a web page. For following them, you can either stick with urllib2 or use requests.
Another option, that could better fit your needs is to use Scrapy web-scraping framework. It has link extracting mechanism built-in:
LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.
Hope that helps.
Upvotes: 1