erikcw
erikcw

Reputation: 11057

Clojure equivalent to Python's lxml library?

I'm looking for the Clojure/Java equivalent to Python's lxml library.

I've used it a ton in the past for parsing all sorts of html (as a replacement for BeautifulSoup) and it's great to be able to use the same elementtree api for xml as well -- really a trusted friend! Can anyone recommend a similar Java/Clojure library?

About lxml

lxml is an xml and html processing library based off of libxml2. It handles broken html pages very well so it is excellent for screen scraping tasks. It also implements the ElementTree api, so the xml/html structure is represented as a tree object with full support for xpath and css selectors among other things.

It also has some really handy utility functions such as the "cleaner" module which will strip out unwanted tags from the "soup" (ie script tags, style tags, etc...).

So it is simple to use, robust, and VERY fast...!

Upvotes: 10

Views: 1399

Answers (2)

dnolen
dnolen

Reputation: 18556

Enlive: http://github.com/cgrand/enlive

I've used it for screen-scraping and it works quite well for that. It uses a CSS selector like syntax for getting at elements in the document.

Upvotes: 8

pmf
pmf

Reputation: 7749

For Java (and thus usable from Clojure) is the tagsoup-library, which, like lxml, is a tolerant parser for faulty SGML-variants.

Clojure has a bundled namespace clojure.xml, but this will only work with valid XML.

Upvotes: 5

Related Questions