Reputation: 8675
I'm trying to extract information from a HTML string but I am having unexpected results. The code I'm using is as follows:
let html: NSString? = "<tbody><tr><td sortkey=\"20151003\">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr><tr><td sortkey=\"20151004\">04 Oct 2015</td><td>2,577.14</td><td>282.49</td><td>Text</td></tr></tbody>"
let rowPattern = "<tr>\\s*<td s.*?<\\/tr>"
let rowRegex = try! NSRegularExpression(pattern: rowPattern, options: [])
let rowMatches = rowRegex.matchesInString(String(html), options: [], range: NSMakeRange(0, html!.length))
for rowMatch in rowMatches {
let rowString: NSString = html!.substringWithRange(rowMatch.resultByAdjustingRangesWithOffset(-9).range)
print(rowString)
let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
let valRegex = try! NSRegularExpression(pattern: valPattern, options: [])
let valMatches = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))
for valMatch in valMatches {
print(valMatch.rangeAtIndex(1))
// let value = rowString.substringWithRange(valMatch.rangeAtIndex(1))
// print(value)
}
}
Output is:
<tr><td sortkey="20151003">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr>
(9223372036854775807,0)
(47,8)
(64,8)
(81,4)
First off, note that I need to offset the range for the rowMatch by -9 to get the correct result. I have no idea why that is the case.
Second, the range returned for the first match is (9223372036854775807,0) which is obviously not correct and throws an error. Again, I don't understand what's going wrong here but I suspect it might be an issue with my regex pattern. The other ranges are correct.
For info, the expected output of print(value)
is:
20151003
8,852.61
1,383.68
Text
Edit:
After further experimentation I found the following:
valMatches[0].rangeAtIndex(2)
gives the correct range for the first match but valMatches[0].rangeAtIndex(1)
is required for the rest. I'm not sure if this is the correct behaviour or if it is a bug as suggested by @t4nhpt in his answer below. Either way, if anyone can explain what's going on it would be good.
Upvotes: 2
Views: 97
Reputation: 539765
The first problem is that let html: NSString? = "..."
is an optional,
and therefore String(html)
evaluates to
Optional(...)
The mysterious offset 9
is the length of "Optional(" :)
To fix that, you can either unwrap String(html!)
or declare html
as a non-optional. In either case, resultByAdjustingRangesWithOffset(-9)
is not necessary.
The second problem is that you have two capture groups in your pattern:
let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
<td>8,852.61</td>
matches the first alternative, therefore the
first capture group matches 8,852.61
, so that
rangeAtIndex(1)
is set to the range of that string, and
rangeAtIndex(2)
is set to (NSNotFound, 0)
.
<td sortkey="20151003">03 Oct 2015</td>
matches the second
alternative, therefore rangeAtIndex(2)
is set to the
range of 20151003
and rangeAtIndex(1)
is (NSNotFound, 0)
.
NSNotFound
is defined as Int.max
and that is 2^63 - 1 = 9223372036854775807
on a 64-bit platform.
Putting it all together, this gives the expected results:
let html: NSString = "<tbody><tr><td sortkey=\"20151003\">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr><tr><td sortkey=\"20151004\">04 Oct 2015</td><td>2,577.14</td><td>282.49</td><td>Text</td></tr></tbody>"
let rowPattern = "<tr>\\s*<td s.*?<\\/tr>"
let rowRegex = try! NSRegularExpression(pattern: rowPattern, options: [])
let rowMatches = rowRegex.matchesInString(String(html), options: [], range: NSMakeRange(0, html.length))
for rowMatch in rowMatches {
let rowString: NSString = html.substringWithRange(rowMatch.range)
print("rowString=\(rowString)")
let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
let valRegex = try! NSRegularExpression(pattern: valPattern, options: [])
let valMatches = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))
for valMatch in valMatches {
if valMatch.rangeAtIndex(1).location != NSNotFound {
let value = rowString.substringWithRange(valMatch.rangeAtIndex(1))
print(value)
}
if valMatch.rangeAtIndex(2).location != NSNotFound {
let value = rowString.substringWithRange(valMatch.rangeAtIndex(2))
print(value)
}
}
}
Output:
rowString=<tr><td sortkey="20151003">03 Oct 2015</td><td>8,852.61</td><td>1,383.68</td><td>Text</td></tr>
20151003
8,852.61
1,383.68
Text
rowString=<tr><td sortkey="20151004">04 Oct 2015</td><td>2,577.14</td><td>282.49</td><td>Text</td></tr>
20151004
2,577.14
282.49
Text
Upvotes: 2
Reputation: 5302
It seem a bug when join two patterns. You can split your pattern into two part, find two [NSTextCheckingResult]
and then concatenate them together. Cheat, haha.
// let valPattern = "<td>(.*?)<\\/td>|<td.*?\"(.*?)\">.*?<\\/td>"
let valPattern1 = "<td.*?\"(.*?)\">.*?<\\/td>"
let valPattern2 = "<td>(.*?)<\\/td>"
var valRegex = try! NSRegularExpression(pattern: valPattern1, options: [])
var valMatches1 = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))
valRegex = try! NSRegularExpression(pattern: valPattern2, options: [])
let valMatches2 = valRegex.matchesInString(String(rowString), options: [], range: NSMakeRange(0, rowString.length))
valMatches1 += valMatches2
for valMatch in valMatches1 {
...
Upvotes: 1