Reputation: 2289
My regular extression (regex) is still work in progress, and I'm having the following issue with trying to extract some anchor text from a hash of where the element is stored.
My hash looks like:
hash["example"]
=> " <a href=\"../Project.html\">Project</a>, <a href=\"../area1.html\">Area 1</a>"
My ruby of which is trying to do the extraction of "Project" and "Area 1":
hash["ITA Area"].scan(/<a href=\"(.*)\">(.*)<\/a>/)
Any help would be much appreciated as always.
Upvotes: 0
Views: 64
Reputation: 146053
The canonical SO reason to use a real HTML parser is calmly explained right here.
However, regexen can parse simple snippets without too much trouble.
Update: Aha, the anchor text. That's actually pretty easy:
> s.scan /([^<>]*)<\/a>/
=> [["Project"], ["Area 1"]]
Upvotes: 0
Reputation: 2829
I'm not entirely sure what your issue is, but the regexp should match. Double quotes " need not be escaped. As mentioned in Dan Breen's answer, you need to use non-greedy matchers if the string is expected to contain more than one possible match.
Upvotes: 0
Reputation: 6882
You will have to exape the backslashes for the backslashes. so something like... \\\\
instead of just \\
. It sounds stupid, but I had a similar problem with it.
Upvotes: 0
Reputation: 12924
Your groups are using greedy matching, so it's going to grab as much as it can before, say, a <
for the second group. Change the (.*)
parts to (.*?)
to use possessive matching.
There are loads of posts here on why you should not be using regex to parse html. There are many reasons why... such as, what if there is more than one space between the a
and href
, etc. It would be ideal to use a tool designed for parsing html.
Upvotes: 2