Reputation: 35285
I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?
Upvotes: 0
Views: 256
Reputation: 107272
All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.
As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path
Upvotes: 1
Reputation: 18111
Some options are:
All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.
Upvotes: 2
Reputation: 11935
import urllib2
urllib2.urlopen('http://stackoverflow.com').read()
That's the simple answer, but you should really look at BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/
Upvotes: 3
Reputation: 73688
I would suggest you use BeautifulSoup
#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2
doc = urllib2.urlopen('http://google.com').read()
soup = BeautifulSoup(''.join(doc))
soup.contents[0].name
After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.
Upvotes: 1