Bruce
Bruce

Reputation: 35285

How can I fetch the page source of a webpage using Python?

I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?

Upvotes: 0

Views: 256

Answers (4)

Jonathan Livni
Jonathan Livni

Reputation: 107272

All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.

As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path

Upvotes: 1

David Alber
David Alber

Reputation: 18111

Some options are:

All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.

Upvotes: 2

jgritty
jgritty

Reputation: 11935

import urllib2
urllib2.urlopen('http://stackoverflow.com').read()

That's the simple answer, but you should really look at BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/

Upvotes: 3

Srikar Appalaraju
Srikar Appalaraju

Reputation: 73688

I would suggest you use BeautifulSoup

#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2

doc = urllib2.urlopen('http://google.com').read()

soup = BeautifulSoup(''.join(doc))

soup.contents[0].name

After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.

Upvotes: 1

Related Questions