Reputation: 3813
Does anyone have experience with a query language for the web?
I am looking for project, commercial or not, that does a good job at making a webpage queryable and that even follows links on it to aggregate information from a bunch of pages.
I would prefere a sql or linq like syntax. I could of course download a webpage and start doing some XPATH on it but Im looking for a solution that has a nice abstraction.
I found websql
http://www.cs.utoronto.ca/~websql/
Which looks good but I'm not into Java
SELECT a.label
FROM Anchor a SUCH THAT base = "http://www.SomeDoc.html"
WHERE a.href CONTAINS ".ps.Z";
Are there others out there?
Is there a library that can be used in a .NET language?
Upvotes: 4
Views: 268
Reputation: 46643
Beautiful Soup and hpricot are the canonical versions, for Python and Ruby respectively.
For C#, I have used and appreciated HTML Agility Pack. It does an excellent job of turning messy, invalid HTML in queryable goodness.
There is also this C# html parser which looks good but I've not tried it.
Upvotes: 2
Reputation: 23792
See hpricot (a Ruby library).
# load the RedHanded home page
doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
(doc/"span.entryPermalink").set("class", "newLinks")
# remove the sidebar
(doc/"#sidebar").remove
# print the altered HTML
puts doc
It supports querying with CSS or XPath selectors.
Upvotes: 3
Reputation: 31133
You are probably looking for SPARQL. It doesn't let you parse pages, but it's designed to solve the same problems (i.e. getting data out of a site -- from the cloud). It's a W3C standard, but Microsoft, apparently, does not support it yet, unfortunately.
Upvotes: 1
Reputation: 992965
I'm not sure whether this is exactly what you're looking for, but Freebase is an open database of information with a programmatic query interface.
Upvotes: 0