Reputation: 10710
Typing this in scala (pattern matching with a regexp to find the value of the id field
val str = """<path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>"""
val Idpattern = """.*id="([^"]*)"(?:[\n\r\t]|.)*""".r
str match {
case Idpattern(id) => id
case _ => "no id"
}
Yields the following exception trace:
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
...
How can I overcome this problem? I could try parsing xml with a library but I don't need something so obfuscated. I thought regexp could be fast and reliable.
Upvotes: 3
Views: 360
Reputation: 39587
Here is the correction to the regex, where you are trying to consume line endings. The (?s)
turns on DOTALL
so dot matches it.
scala> val Idpattern = """.*id="([^"]*)"(?s).*""".r
Idpattern: scala.util.matching.Regex = .*id="([^"]*)"(?s).*
scala> str match { case Idpattern(id) => id }
res6: String = basarbre
Here's the better way to find the pattern in Scala:
scala> val Idpattern = """ id="([^"]*)" """.r.unanchored
Idpattern: scala.util.matching.UnanchoredRegex = id="([^"]*)"
scala> str match { case Idpattern(id) => id }
res7: String = basarbre
Upvotes: 2
Reputation: 6130
Actually scala provides native xml manipulation. So if you remove the """
at the beginning and end of str
, it will become a NodeSeq
that you can easily manipulate, like:
val str = <path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>
val idAttribute = str \\ "@id"
val id = if (idAttribute.isEmpty) "no id" else idAttribute.text
You can read more here
Upvotes: 5
Reputation: 13667
For a task like this, its better to write a regex that only matches part of the string:
scala> val Idpattern = """id="([^"]*)"""".r
scala> Idpattern.findFirstMatchIn(str).map(_.group(1))
res10: Option[String] = Some(basarbre)
This way, the regex engine can start by scanning through the string for an 'i'. With your original regex, the greedy .*
will match the entire string, and then the regex engine will start backtracking from the end. As for why your regex blew the stack, I think this might be a problem with Java's handling of the alternation at the end of the regex, but I'm not really sure. The shorter regex gives less opportunity for recursion.
Upvotes: 2