Andrew Larned
Andrew Larned

Reputation: 933

Replace strings using regular expressions and backreferences in Clojure

I'm trying to convert from HTML to Latex, and want to change this:

<a href="www.foo.com/bar">baz</a> 

into:

baz\footnote{www.foo.com/bar}

I'd like to generate a Clojure function to take a chunk of text, and replace as many matches as exist in a given paragraph.

I've tried

(.replaceAll 
    "<a href=\"foo.com\">baz</a>" 
    "<a.*href=\"(.*)\">(.*)</a>" 
    "\2\\footnote{\1}")

but that returns:

"^Bfootnote{^A}"

I've also looked at clojure.contrib.str-utils2, which has a replace function that uses regular expressions, but it doesn't seem to handle backreferences. Am I missing something? Going about this the wrong way? Any help is appreciated.

Upvotes: 2

Views: 1582

Answers (2)

kotarak
kotarak

Reputation: 17299

And if you want to be really spiffy, you go for clojure.xml. It will return a tree of structures you can modify as you like. Your above example would look like this:

{:tag :a :attrs {:href "www.foo.com/bar"} :content ["bar"]}

This can be easily translated to something like:

["bar" {:footnote "www.foo.com/bar"}]

which can be easily serialised back to your desired form. And the best part: No unmaintainable regexes. :) YMMV of course.....

Upvotes: 1

Brian Carper
Brian Carper

Reputation: 72926

(You should not parse HTML with a regex...)

Two things:

  1. Java uses $1, $2 to refer to capture groups, not \1, \2.

  2. You need more backslashes in the replacement text. The first level of backslashing is consumed by the Clojure reader because it's a literal string. The second level of backslashing is consumed by the regex. Unfortunately Clojure doesn't have a general syntax for "raw" String literals (yet?). The Clojure literal regex syntax #"" does some magic to save you some backslashes, but normal Strings don't have that magic.

So:

user> (.replaceAll "<a href=\"www.foo.com/bar\">baz</a>"
                   "<a.*href=\"(.*)\">(.*)</a>"
                   "$2\\\\footnote{$1}")
"baz\\footnote{www.foo.com/bar}"

You can also do it this way:

user> (require '(clojure.contrib [str-utils2 :as s]))
nil
user> (s/replace "<a href=\"www.foo.com/bar\">baz</a>"
                 #"<a.*href=\"(.*)\">(.*)</a>"
                 (fn [[_ url txt]]
                     (str txt "\\\\footnote{" url "}")))
"baz\\footnote{www.foo.com/bar}"

"\2" is a control character (ASCII character 2) which is why it's displayed as ^B. Nearly the same as doing (char 2).

Upvotes: 4

Related Questions