Reputation: 4814
I'm trying to run a Clojure regex on a Groovy source file to parse out the individual functions.
// gremlin.groovy
def warm_cache() {
for (vertex in g.getVertices()) {
vertex.getOutEdges()
}
}
def clear() {
g.clear()
}
This is the pattern I'm using in Clojure:
(def source (read-file "gremlin.groovy"))
(def pattern #"(?m)^def.*[^}]")
(re-seq pattern source)
However, it's only grabbing the first line, not the multiline func.
Upvotes: 1
Views: 1743
Reputation: 171084
As a demonstration of how you can grab the AST from the GroovyRecognizer
, and avoid having the cope with trying to parse a language using regular expressions, you can do this in Groovy:
import org.codehaus.groovy.antlr.*
import org.codehaus.groovy.antlr.parser.*
def code = '''
// gremlin.groovy
def warm_cache() {
for (vertex in g.getVertices()) {
vertex.getOutEdges()
}
}
def clear() {
g.clear()
}
'''
def ast = new GroovyRecognizer( new GroovyLexer( new StringReader( code ) ).plumb() ).with { p ->
p.compilationUnit()
p.AST
}
while( ast ) {
println ast.toStringTree()
ast = ast.nextSibling
}
That prints out the AST for each GroovySourceAST node in the AST, giving you (for this example):
( METHOD_DEF MODIFIERS TYPE warm_cache PARAMETERS ( { ( for ( in vertex ( ( ( . g getVertices ) ELIST ) ) ( { ( EXPR ( ( ( . vertex getOutEdges ) ELIST ) ) ) ) ) )
( METHOD_DEF MODIFIERS TYPE clear PARAMETERS ( { ( EXPR ( ( ( . g clear ) ELIST ) ) ) )
You should be able to do the same thing with Clojure's java interop and the groovy-all jar file
To get a bit more info, you just need to drill down into the AST and manipulate the input script a bit. Changing the while
loop in the above code to:
while( ast ) {
if( ast.type == GroovyTokenTypes.METHOD_DEF ) {
println """Lines $ast.line to $ast.lineLast
| Name: $ast.firstChild.nextSibling.nextSibling.text
| Code: ${code.split('\n')[ (ast.line-1)..<ast.lineLast ]*.trim().join( ' ' )}
| AST: ${ast.toStringTree()}""".stripMargin()
}
ast = ast.nextSibling
}
prints out:
Lines 4 to 8
Name: warm_cache
Code: def warm_cache() { for (vertex in g.getVertices()) { vertex.getOutEdges() } }
AST: ( METHOD_DEF MODIFIERS TYPE warm_cache PARAMETERS ( { ( for ( in vertex ( ( ( . g getVertices ) ELIST ) ) ( { ( EXPR ( ( ( . vertex getOutEdges ) ELIST ) ) ) ) ) )
Lines 10 to 12
Name: clear
Code: def clear() { g.clear() }
AST: ( METHOD_DEF MODIFIERS TYPE clear PARAMETERS ( { ( EXPR ( ( ( . g clear ) ELIST ) ) ) )
Obviously, the Code:
section is just the lines joined back together, so might not work if pasted back into groovy, but they give you an idea of the original code...
Upvotes: 6
Reputation: 436
Short answer
(re-seq (Pattern/compile "(?m)^def.*[^}]" Pattern/MULTILINE) source)
From http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
You need to be able to pass in
Pattern.MULTILINE
when the pattern is compiled. But there is no option for this on re-seq, so you'll probably need to drop down into Java interop to get this to work properly? Ideally, you really should be able to specify this in Clojure land... :(
UPDATE:
Actually, it's not all that bad. Instead of using the literal expression for a regex, just use Java interop for your pattern. Use (re-seq (Pattern/compile "(?m)^def.*[^}]" Pattern/MULTILINE) source)
instead (assuming that you've imported java.util.regex.Pattern). I haven't tested this, but I think that will do the trick for you.
Upvotes: 2
Reputation: 200148
It's your regex, not Clojure. You request to match def
, then anything, then one char that is not equal to the closing brace. That char can be anywhere. What you want to achieve is this: (?sm)def.*?^}
.
Upvotes: 3