Roy Tang
Roy Tang

Reputation: 5771

jquery-like HTML parsing in Python?

Is there any way in Python that would allow me to parse an HTML document similar to what jQuery does?

i.e. I'd like to be able to use CSS selectors syntax to grab an arbitrary set of nodes from the document, read their content/attributes, etc.

Upvotes: 70

Views: 36216

Answers (4)

systempuntoout
systempuntoout

Reputation: 74154

If you are fluent with BeautifulSoup, you could just add soupselect to your libs.
Soupselect is a CSS selector extension for BeautifulSoup.

Usage:

from bs4 import BeautifulSoup as Soup
from soupselect import select
import urllib
soup = Soup(urllib.urlopen('http://slashdot.org/'))
select(soup, 'div.title h3')
    [<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>,
     <h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>,
    ..]

Upvotes: 65

imbr
imbr

Reputation: 7682

BeautifulSoup, now has support for css selectors

import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('https://stackoverflow.com/questions/3051295').content
soup = Soup(html)

Title of this question

soup.select('h1.grid--cell :first-child')[0].text

Number of question upvotes

# first item 
soup.select_one('[itemprop="upvoteCount"]').text

using Python Requests to get the html page

Upvotes: 10

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799560

The lxml library supports CSS selectors.

Upvotes: 14

Luke Stanley
Luke Stanley

Reputation: 1294

Consider PyQuery:

http://packages.python.org/pyquery/

>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url='http://google.com/')
>>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read())
>>> d = pq(filename=path_to_html_file)
>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> p.html()
'Hello world !'
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> p.html()
u'you know <a href="http://python.org/">Python</a> rocks'
>>> p.text()
'you know Python rocks'

Upvotes: 50

Related Questions