Jonathan
Jonathan

Reputation: 1367

Reference implementation / .lib for full url encoding

I'm writing a Java application which parses links from html & uses them to request their content. The area of url encoding when we have no idea of the "intent" of the url author is very thorny. For example when to use %20 or + is a complex issue: (%20 vs +), a browser would perform this encoding for a url containing an un-encoded space.

There are many other situations in which a browser would change the content of a parsed url before requesting a page, for example:

http://www.Example.com/þ

... when parsed & requested by a browser becomes ...

http://www.Example.com/%C3%BE

.. and...

http://www.Example.com/&

... when parsed & requested by a browser becomes ...

http://www.Example.com/&

So my question is, instead of re-inventing the wheel again is there perhaps a Java library I haven't found to do this job? Failing that can anyone point me towards a reference implementation in a common browsers source? or perhaps pseudo code? Failing that, any recommendations on approach welcome!

Thanks, Jon

Upvotes: 0

Views: 180

Answers (2)

Tom Anderson
Tom Anderson

Reputation: 47223

HtmlUnit can certainly pick URLs out of HTML and resolve them (and much more).

I don't know whether it handles your corner cases, though. I would imagine it will handle the second, since that is a normal, if slightly funny-looking, use of HTML and a URL. I don't know what it will do with the second, in which an invalid URL is encoded in HTML.

I also know that if you find that HTMLUnit does something differently to how real browsers do it, write a JUnit test case to prove it, and file a bug report, then its maintainers will happily fix it with great alacrity.

Upvotes: 1

mamboking
mamboking

Reputation: 4637

How about using java.net.URLEncoder.encode() & java.net.URLDecoder.decode().

Upvotes: 0

Related Questions