Reputation: 3501
I am working on a groovy script that will get all the local html files and parse certain tags in them. I tried using something like html clean and it just is not working. I tried to read each line but that only works when the stuff I need is on 1 line. I have this script up on github, https://github.com/jrock2004/johns-octopress-scripts/blob/master/convertCompiledPosts/convertPosts.groovy. Thanks for any input
Edit: So I am getting closer. I have this code now
def parser = new org.cyberneko.html.parsers.SAXParser()
new XmlParser( parser ).parse( curFile+ "/index.html" ).with { page ->
page.'**'.DIV.grep { it.'@class'?.contains 'entry-content' }.each {
println it
println "--------------------------------"
}
}
And what it prints is
DIV[attributes={class=entry-content}; value=[P[attributes={}; value=[As an automation developer, I have learned how to write code in Java. When I am having an issue, one of the nice things that you can do is debug your code, line by line. For the longest I had wished that something like this existed in PHP. I have come to find out that you can actually debug code, like I do in Java. This is such a helpful task because I do not have to waste time using var_dump and such on variables or results. In your apache/php server you need to install and or enable something called, A[attributes={href=http://xdebug.org/}; value=[Xdebug]], . I will work on a tutorial on how to use xdebug while writing code in Sublime Text 2. So keep an eye out on my blog and or, A[attributes={href=http://www.youtube.com/jrock20041}; value=[YouTube]], channel for this tutorial.]]]]
So basically what I want is I wall the text including the html elements in the div with the class entry-content. If you want to see the page it can be found here -- http://jcwebconcepts.net/blog/2013/02/02/xdebug/
Thanks for your help
Upvotes: 1
Views: 9080
Reputation: 171114
It does work... Save the HTML for this page to a file, then you can parse it.
The following code prints the name of the author of every comment on the page:
@Grab('net.sourceforge.nekohtml:nekohtml:1.9.16')
def parser = new org.cyberneko.html.parsers.SAXParser()
new XmlParser( parser ).parse( file ).with { page ->
page.'**'.A.grep { it.'@class'?.contains 'comment-user' }.each {
println it.text()
}
}
When file
is set to be a File
pointing to the saved HTML (or a String
containing the URL of this question), it prints:
tim_yates
jrock2004
tim_yates
To print the contents of a given node, you could do (using the example from your edited question):
@Grab('net.sourceforge.nekohtml:nekohtml:1.9.16')
import groovy.xml.*
def parser = new org.cyberneko.html.parsers.SAXParser()
new XmlParser( parser ).parse( 'http://jcwebconcepts.net/blog/2013/02/02/xdebug/' ).with { page ->
page.'**'.DIV.grep { it.'@class'?.contains 'entry-content' }.each { it ->
println XmlUtil.serialize( it )
}
}
Upvotes: 2