Reputation: 933
I'm trying to convert from HTML to Latex, and want to change this:
<a href="www.foo.com/bar">baz</a>
into:
baz\footnote{www.foo.com/bar}
I'd like to generate a Clojure function to take a chunk of text, and replace as many matches as exist in a given paragraph.
I've tried
(.replaceAll
"<a href=\"foo.com\">baz</a>"
"<a.*href=\"(.*)\">(.*)</a>"
"\2\\footnote{\1}")
but that returns:
"^Bfootnote{^A}"
I've also looked at clojure.contrib.str-utils2
, which has a replace function that uses regular expressions, but it doesn't seem to handle backreferences. Am I missing something? Going about this the wrong way? Any help is appreciated.
Upvotes: 2
Views: 1582
Reputation: 17299
And if you want to be really spiffy, you go for clojure.xml. It will return a tree of structures you can modify as you like. Your above example would look like this:
{:tag :a :attrs {:href "www.foo.com/bar"} :content ["bar"]}
This can be easily translated to something like:
["bar" {:footnote "www.foo.com/bar"}]
which can be easily serialised back to your desired form. And the best part: No unmaintainable regexes. :) YMMV of course.....
Upvotes: 1
Reputation: 72926
(You should not parse HTML with a regex...)
Two things:
Java uses $1
, $2
to refer to capture groups, not \1
, \2
.
You need more backslashes in the replacement text. The first level of backslashing is consumed by the Clojure reader because it's a literal string. The second level of backslashing is consumed by the regex. Unfortunately Clojure doesn't have a general syntax for "raw" String literals (yet?). The Clojure literal regex syntax #""
does some magic to save you some backslashes, but normal Strings don't have that magic.
So:
user> (.replaceAll "<a href=\"www.foo.com/bar\">baz</a>"
"<a.*href=\"(.*)\">(.*)</a>"
"$2\\\\footnote{$1}")
"baz\\footnote{www.foo.com/bar}"
You can also do it this way:
user> (require '(clojure.contrib [str-utils2 :as s]))
nil
user> (s/replace "<a href=\"www.foo.com/bar\">baz</a>"
#"<a.*href=\"(.*)\">(.*)</a>"
(fn [[_ url txt]]
(str txt "\\\\footnote{" url "}")))
"baz\\footnote{www.foo.com/bar}"
"\2"
is a control character (ASCII character 2) which is why it's displayed as ^B
. Nearly the same as doing (char 2)
.
Upvotes: 4