Mikaël Mayer
Mikaël Mayer

Reputation: 10710

scala regexp stackoverflow

Typing this in scala (pattern matching with a regexp to find the value of the id field

val str = """<path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>"""

val Idpattern = """.*id="([^"]*)"(?:[\n\r\t]|.)*""".r

str match {
  case Idpattern(id) => id
  case _ => "no id"
}

Yields the following exception trace:

at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
...

How can I overcome this problem? I could try parsing xml with a library but I don't need something so obfuscated. I thought regexp could be fast and reliable.

Upvotes: 3

Views: 360

Answers (3)

som-snytt
som-snytt

Reputation: 39587

Here is the correction to the regex, where you are trying to consume line endings. The (?s) turns on DOTALL so dot matches it.

scala> val Idpattern = """.*id="([^"]*)"(?s).*""".r
Idpattern: scala.util.matching.Regex = .*id="([^"]*)"(?s).*

scala> str match { case Idpattern(id) => id }
res6: String = basarbre

Here's the better way to find the pattern in Scala:

scala> val Idpattern = """ id="([^"]*)" """.r.unanchored
Idpattern: scala.util.matching.UnanchoredRegex =  id="([^"]*)" 

scala> str match { case Idpattern(id) => id }
res7: String = basarbre

Upvotes: 2

Chirlo
Chirlo

Reputation: 6130

Actually scala provides native xml manipulation. So if you remove the """ at the beginning and end of str, it will become a NodeSeq that you can easily manipulate, like:

val str = <path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>

val idAttribute = str \\ "@id"     

val  id = if (idAttribute.isEmpty) "no id" else idAttribute.text

You can read more here

Upvotes: 5

wingedsubmariner
wingedsubmariner

Reputation: 13667

For a task like this, its better to write a regex that only matches part of the string:

scala> val Idpattern = """id="([^"]*)"""".r
scala> Idpattern.findFirstMatchIn(str).map(_.group(1))
res10: Option[String] = Some(basarbre)

This way, the regex engine can start by scanning through the string for an 'i'. With your original regex, the greedy .* will match the entire string, and then the regex engine will start backtracking from the end. As for why your regex blew the stack, I think this might be a problem with Java's handling of the alternation at the end of the regex, but I'm not really sure. The shorter regex gives less opportunity for recursion.

Upvotes: 2

Related Questions