memo1288
memo1288

Reputation: 738

Which characters can never appear in a URL?

I am storing a big amount of URLs (around 100,000) in a XML file (along with some other data). It worked fine with fewer URLs, but now, the XML file has become very big (because of tags and indentation) and slow to parse. So I thought about grouping all the URLs inside a single XML element, and for that I need a delimiter. As an example, I would like to go from this:

<document>
  <bigGroupOfURLs>
    <OneURL>
      <nameOfData1>data1_1</nameOfData1>
      <nameOfData2>data1_2</nameOfData2>
      <URL>www.site1.com</URL>
    </OneURL>
    <OneURL>
      <nameOfData1>data2_1</nameOfData1>
      <nameOfData2>data2_2</nameOfData2>
      <URL>www.site2.com</URL>
    </OneURL>
  </bigGroupOfURLs>
  <someOtherData>...</someOtherData>
</document>

To something like this (but not using #):

<document>
  <bigGroupOfURLs>
    data1#data2#www.site1.com#data1#data2#www.site2.com
  </bigGroupOfURLs>
  <someOtherData>...</someOtherData>
</document>

These URLs will come from tags inside HTML files, so they can come with all sorts of non-standard characters. For instance, the following are examples which may be included:

<a href="http://ja.wikipedia.org/wiki/メインページ">メインページ</a>
<a href="http://en.wikipedia.org/wiki/Stack Overflow">Stack Overflow</a>

There, we can see UTF-8 characters and a space. These URLs are correctly interpreted, and I want to store them as they appear there. So, which character is guaranteed to never appear in a URL? I would prefer it to be a printable character. Notice that this will be inside a XML file, so I probably should not use the characters </>.

Upvotes: 4

Views: 2028

Answers (2)

Michael Kay
Michael Kay

Reputation: 163595

There is more than one definition of "URL". Very often the term is used where "URI" or "IRI" is more correct. Many systems try to be permissive and allow things that are not technically legal according to the specs; Postel's law applies here, with its inevitable consequence that if some systems start being liberal about what they accept, everyone else has to follow suit.

A pretty safe delimiter to use is a single space, especially if you take care to ensure that any spaces within a URL are properly %-encoded as %20.

But before going for a micro-syntax like this, I would want to be quite convinced that XML parsing time really is the bottleneck.

Upvotes: 3

nwellnhof
nwellnhof

Reputation: 33658

Both of the URLs you mentioned are actually invalid:

http://ja.wikipedia.org/wiki/メインページ
http://en.wikipedia.org/wiki/Stack Overflow

If you type them in your browser, they will be percent-encoded before they're sent to the server. According to RFC 3986, the space character and the following printable ASCII characters are invalid in an URL:

" < > \ ^ ` { | }

Multi-byte UTF-8 sequences are invalid as well. That said, it's possible that some servers still accept these characters.

So I'd suggest that you normalize your URLs and separate them with whitespace.

Upvotes: 2

Related Questions