Which characters can never appear in a URL?

Question

I am storing a big amount of URLs (around 100,000) in a XML file (along with some other data). It worked fine with fewer URLs, but now, the XML file has become very big (because of tags and indentation) and slow to parse. So I thought about grouping all the URLs inside a single XML element, and for that I need a delimiter. As an example, I would like to go from this:


  
    
      data1_1
      data1_2
      www.site1.com
    
    
      data2_1
      data2_2
      www.site2.com
    
  
  ...

To something like this (but not using #):


  
    data1#data2#www.site1.com#data1#data2#www.site2.com
  
  ...

These URLs will come from tags inside HTML files, so they can come with all sorts of non-standard characters. For instance, the following are examples which may be included:

メインページ
Stack Overflow

There, we can see UTF-8 characters and a space. These URLs are correctly interpreted, and I want to store them as they appear there. So, which character is guaranteed to never appear in a URL? I would prefer it to be a printable character. Notice that this will be inside a XML file, so I probably should not use the characters .

Michael Kay · Accepted Answer

There is more than one definition of "URL". Very often the term is used where "URI" or "IRI" is more correct. Many systems try to be permissive and allow things that are not technically legal according to the specs; Postel's law applies here, with its inevitable consequence that if some systems start being liberal about what they accept, everyone else has to follow suit.

A pretty safe delimiter to use is a single space, especially if you take care to ensure that any spaces within a URL are properly %-encoded as %20.

But before going for a micro-syntax like this, I would want to be quite convinced that XML parsing time really is the bottleneck.

Which characters can never appear in a URL?

Answers (2)

Related Questions