Reputation: 2887
I am trying to parse an html page that contains these values:
<a href="somesite.html?id=123">...</a>
<a href="somesite.html?id=456">...</a>
<a href="somesite.html?id=789">...</a>
<a href="anothersite.html">...</a>
How would I parse the Html String to get back an array of where it only contains the somesite.html:
["somesite.html?id=123", "somesite.html?id=456", "somesite.html?id=456"]
Edited
Using Zhiguo Wang's base answer, I can't seem to get only the somesite.html id values... The 3rd item in the array contains excess characters:
let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
"<a href=\"somesite.html?id=456\">...</a>" +
"<a href=\"somesite.html?id=789\">...</a>" +
"<a href=\"anothersite.html\">...</a>\""
let seperateComponent = "<a href=\"somesite.html?id="
let linkExp = "[\\w\\W]*\">"
Returns this value:
["123", "456", "789\\">...</a><a href=\\"anothersite.html"]
Expected Value: ["123", "456", "789"]
...hmm. Changing linkExp to the below resolves it. What does \W represent in Regex?
let linkExp = "[\\w]*\">"
..The length is wrong. Casted to NSString to grabbed the proper length.
Edited 2
It looks like if this string comes first before the somesite, then it includes Origin in the array:
<meta name=\"referrer\" content=\"origin\">
Upvotes: 0
Views: 428
Reputation: 26
here's the improved code
let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
"<a href=\"somesite.html?id=456\">...</a>" +
"<a href=\"somesite.html?id=789\">...</a>" +
"<a href=\"anothersite.html\">...</a>\""
let seperateComponent = "<a href=\""
let linkExp = "[\\w\\W]*\">"
let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
var resultArray = [String]()
if seperatedArray.count > 1 {
for seperatedString in seperatedArray {
if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
if myRange.location != NSNotFound {
let matchString = (seperatedString as NSString).substringWithRange(myRange)
let linkWished = "somesite.html?id="
if matchString.componentsSeparatedByString(linkWished).count > 1{
var linkString = (matchString as NSString).substringFromIndex(linkWished.lengthOfBytesUsingEncoding(NSUTF8StringEncoding))
linkString = (linkString as NSString).substringToIndex(linkString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)
resultArray.append(linkString)
}
}
}
}
}
println(resultArray)
Upvotes: 0
Reputation: 26
talk is cheap, show me the code
let htmlString = "<a href=\"somesite.html?id=123\">...</a><a href=\"somesite.html?id=456\">...</a><a href=\"somesite.html?id=789\">...</a>"
let seperateComponent = "<a href=\""
let linkExp = "[\\w\\W]*\">"
let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
var resultArray = [String]()
if seperatedArray.count > 1 {
for seperatedString in seperatedArray {
if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
if myRange.location != NSNotFound {
let matchString = (seperatedString as NSString).substringWithRange(myRange)
let linkString = (matchString as NSString).substringToIndex(matchString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)
resultArray.append(linkString)
}
}
}
}
println(resultArray)
these codes have been run on xcode 6.4 and the result is right.sorry " i need at least 10 reputation to post images" so result pic won't be posted here.
Upvotes: 1