asit_dhal
asit_dhal

Reputation: 1269

proper usages of url encoding

I am making an HTTP Client where I need to send HTTP get request to fetch data. I am using boost asio library, hence I have no way to use any standard url encoding library.

Here is what I got from netcat and Mozilla(a typical get request)

localhost:2000/questions/10838702/how-to-encode or-d   ecode-url-in-objective-c

Get Request Url

F:\pydev>nc -l -p 2000
GET /questions/10838702/how-to-encode%20or-d%20%20%20ecode-url-in-objective-c HTTP/1.1
Host: localhost:2000
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

I found Mozilla only encodes the query part of the url.

I tried this url encoding webpage http://meyerweb.com/eric/tools/dencoder/

And it encodes the following url

localhost:2000/questions/10838702/how-to-encode or-d   ecode-url-in-objective-c

to

localhost%3A2000%2Fquestions%2F10838702%2Fhow-to-encode%20or-d%20%20%20ecode-url-in-objective-c

Can anyone suggest me where to use URL encoding ?

Upvotes: 1

Views: 240

Answers (1)

slashingweapon
slashingweapon

Reputation: 11317

As a general rule, any character other than alphanumerics (A-Z0-9), - _ . and ~ either have some special purpose in a URL, or are not allowed.

Reserved characters are ; / ? : @& = and space. If you use any of those characters in a way other than their special meaning, then you must URL-encode it. To be safe, a lot of encoders just encode everything that isn't explicitly safe.

For example, let's say you have a file name with a question mark in it (let's name the file file?name, and you need to create a URL. The problem is that http://somehost.com/file?name will not be interpreted the way you want it to be. The url will match /file in your web space, and have a search term of name. You have to encode the file name to get the URL http://somehost.com/file%3Fname.

The spec allows you to URL-encode any character, even alphanumerics, with the expectation that they will be un-encoded by the server. You just have to make sure that wherever reserved characters are used for their intended purpose, they are not encoded. eg: You don't want to encode the colon or slashes in http://somehost.com because they are being used as delimters.

The most frequent use of url-encoding is to prepare form data. In this case you usually start with a set of key-value pairs. You would construct the encoded data for a form like so (in pseudocode):

  1. Encode the key and value
  2. Concatenate the key and value with '=' between them to get terms. eg: encodedKey=encodedValue.
  3. Repeat 1 and 2 until you have a list of terms
  4. Join all the terms with ampersands. eg: encKey1=encVal1&encKey2=encVal2

Decoding is the reverse process:

  1. Split the form data along the '&' signs to get an array of terms
  2. Split each term on the '=' character to get the encoded key and value
  3. Decode the key and value

It sounds simple, but you might be shocked at how many people get it wrong.

I have glossed over some of the finer details here. As always, the relevant specification is the last word. In this case, RFC 1738.

Upvotes: 2

Related Questions