Tom
Tom

Reputation: 34366

Find spaces in anchor links

We've got a large amount of static that HTML has links like e.g.

<a href="link.html#glossary">Link</a>

However some of them contain spaces in the anchor e.g.

 <a href="link.html#this is the glossary">Link</a>

Any ideas on what kind of regular expression I'd need to use to find the Spaces after the # and replace them with a - or _

Update: Just need to find them using TextMate, hence no need for a HTML parsing lib.

Upvotes: 1

Views: 914

Answers (3)

Sebastian Hoitz
Sebastian Hoitz

Reputation: 9383

Here, this regex matches the hash and all the words and spaces in between:

#(\w+\s)+\w+

http://dl.getdropbox.com/u/5912/Jing/2009-08-12_1651.png

When you have some time, you should download "The Regex Coach", which is an awesome tool to develop your own regexes. You get instant feedback and you learn very fast. Plus it comes at no cost!

Visit the homepage

Upvotes: 1

Mark
Mark

Reputation: 2071

This regex should do it:

#[a-zA-Z]+\s+[a-zA-Z\s]+

Three Caveats.

First, if you are afraid that the page text itself (and not just the links) might contain information like "#hashtag more words", then you could make the regex more restrictive, like this:

#[a-zA-Z]+\s+[a-zA-Z\s]+\">

Second, if you have hash tags that contain characters beyond A-Z, then just add them in between the second set of brackets. So, if you have '-' as well, you would modify to:

#[a-zA-Z]+\s+[a-zA-Z-\s]+\">

Finally, this assumes that all the links you are trying to match start with a letter/word and are followed by a space, so, in the current form, it would not match "Anchor-tags-galore", but would match "Anchor tags galore."

Upvotes: 2

user122299
user122299

Reputation:

Have you considered using an HTML parsing library like BeautifulSoup? It would make finding all the hrefs much easier!

Upvotes: 2

Related Questions