doovers
doovers

Reputation: 8675

Swift 2 Regex unexpected behaviour

I'm trying to extract information from a HTML string but I am having unexpected results. The code I'm using is as follows:

let html: NSString? = "<tbody><tr><td sortkey=\"20151003\">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr><tr><td sortkey=\"20151004\">04 Oct 2015</td><td>2,577.14</td><td>282.49</td><td>Text</td></tr></tbody>"

let rowPattern = "<tr>\\s*<td s.*?<\\/tr>"
let rowRegex = try! NSRegularExpression(pattern: rowPattern, options: [])
let rowMatches = rowRegex.matchesInString(String(html), options: [], range: NSMakeRange(0, html!.length))

for rowMatch in rowMatches {
    let rowString: NSString = html!.substringWithRange(rowMatch.resultByAdjustingRangesWithOffset(-9).range)

    print(rowString)

    let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
    let valRegex = try! NSRegularExpression(pattern: valPattern, options: [])
    let valMatches = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))

    for valMatch in valMatches {
        print(valMatch.rangeAtIndex(1))
        // let value = rowString.substringWithRange(valMatch.rangeAtIndex(1))
        // print(value)
    }
}

Output is:

<tr><td sortkey="20151003">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr>
(9223372036854775807,0)
(47,8)
(64,8)
(81,4)

First off, note that I need to offset the range for the rowMatch by -9 to get the correct result. I have no idea why that is the case.

Second, the range returned for the first match is (9223372036854775807,0) which is obviously not correct and throws an error. Again, I don't understand what's going wrong here but I suspect it might be an issue with my regex pattern. The other ranges are correct.

For info, the expected output of print(value) is:

20151003
8,852.61
1,383.68
Text

Edit:

After further experimentation I found the following:

valMatches[0].rangeAtIndex(2) gives the correct range for the first match but valMatches[0].rangeAtIndex(1) is required for the rest. I'm not sure if this is the correct behaviour or if it is a bug as suggested by @t4nhpt in his answer below. Either way, if anyone can explain what's going on it would be good.

Upvotes: 2

Views: 97

Answers (2)

Martin R
Martin R

Reputation: 539765

The first problem is that let html: NSString? = "..." is an optional, and therefore String(html) evaluates to

Optional(...)

The mysterious offset 9 is the length of "Optional(" :)

To fix that, you can either unwrap String(html!) or declare html as a non-optional. In either case, resultByAdjustingRangesWithOffset(-9) is not necessary.


The second problem is that you have two capture groups in your pattern:

let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"

<td>8,852.61</td> matches the first alternative, therefore the first capture group matches 8,852.61, so that rangeAtIndex(1)is set to the range of that string, and rangeAtIndex(2) is set to (NSNotFound, 0).

<td sortkey="20151003">03 Oct 2015</td> matches the second alternative, therefore rangeAtIndex(2) is set to the range of 20151003 and rangeAtIndex(1) is (NSNotFound, 0).

NSNotFound is defined as Int.max and that is 2^63 - 1 = 9223372036854775807 on a 64-bit platform.


Putting it all together, this gives the expected results:

let html: NSString = "<tbody><tr><td sortkey=\"20151003\">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr><tr><td sortkey=\"20151004\">04 Oct 2015</td><td>2,577.14</td><td>282.49</td><td>Text</td></tr></tbody>"

let rowPattern = "<tr>\\s*<td s.*?<\\/tr>"
let rowRegex = try! NSRegularExpression(pattern: rowPattern, options: [])
let rowMatches = rowRegex.matchesInString(String(html), options: [], range: NSMakeRange(0, html.length))

for rowMatch in rowMatches {
    let rowString: NSString = html.substringWithRange(rowMatch.range)

    print("rowString=\(rowString)")

    let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
    let valRegex = try! NSRegularExpression(pattern: valPattern, options: [])
    let valMatches = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))

    for valMatch in valMatches {
        if valMatch.rangeAtIndex(1).location != NSNotFound {
            let value = rowString.substringWithRange(valMatch.rangeAtIndex(1))
            print(value)
        }
        if valMatch.rangeAtIndex(2).location != NSNotFound {
            let value = rowString.substringWithRange(valMatch.rangeAtIndex(2))
            print(value)
        }
    }
}

Output:

rowString=<tr><td sortkey="20151003">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr>
20151003
8,852.61
1,383.68
Text
rowString=<tr><td sortkey="20151004">04 Oct 2015</td><td>2,577.14</td><td>282.49</td><td>Text</td></tr>
20151004
2,577.14
282.49
Text

Upvotes: 2

t4nhpt
t4nhpt

Reputation: 5302

It seem a bug when join two patterns. You can split your pattern into two part, find two [NSTextCheckingResult] and then concatenate them together. Cheat, haha.

  // let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
  let valPattern1 = "<td.*?\"(.*?)\">.*?<\\/td>"
  let valPattern2 = "<td>(.*?)<\\/td>"
  var valRegex = try! NSRegularExpression(pattern: valPattern1, options: [])
  var valMatches1 = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))
  valRegex = try! NSRegularExpression(pattern: valPattern2, options: [])
  let valMatches2 = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))

  valMatches1 += valMatches2

  for valMatch in valMatches1 {
       ...

Upvotes: 1

Related Questions