Reputation: 529

How can I replace the URL-encoded sequences from a string in a C program?

I have a string that looks like:

http%3A%2F%2Fmanifest.googlevideo.com%2Fapi%2Fmanifest%2Fdash%2Fms%2Fau%2Fmt%2F1466992558%2Fmv%2Fm%2Fsver%2F3%2Fitag%2F0%2Fsignature%2F9811214A6751583E8AAD1951B992D8E011C91E5C.5DBFF1BD54C73C370B058B8BE27CB5848CEAF391%2Fkey%2Fyt6%2Fmn%2Fsn-ipoxu-un56%2Fas%2Ffmp4

There are some ~~control characters~~ URL-encoded characters within it and I want them to be replaced by characters instead.

Does anyone know if there is a convenient way to do this?

Upvotes: 1

Answers (4)

flodis

Reputation: 1221

This is a popular and compact conversion from number to hex char. It may inspire to other number to char translations.

char hex_digit(char c)
{
    return "0123456789ABCDEF"[c & 0x0F];
}

Upvotes: 0

Myst

Reputation: 19221

Read Jonathan Leffler's comprehensive answer, as the information in this answer and the code presented here is reviewed there with more information and benchmarks.

Notice the code here was revised, so it's presented towards the end of Jonathan Leffler's answer.

I'm editing the answer to incorporate Jonathan Leffler's concept of unifying the error check for escaped sequences (isxdigit) with the conversion from hex to char.

I'm using a macro that causes the decoding function to return -1 on failure to decode (this can't be done with inline functions in a single step).

#define hex_val(c)                                                        \
  (((c) >= '0' && (c) <= '9') ? ((c)-48) : (((c) >= 'a' && (c) <= 'f') || \
                                            ((c) >= 'A' && (c) <= 'F'))     \
                                               ? (((c) | 32) - 87)        \
                                               : ({                       \
                                                   return -1;             \
                                                   0;                     \
                                                 }))

static ssize_t decode_url(char* dest, const char* url_data, size_t length) {
  char* pos = dest;
  const char* end = url_data + length;
  while (url_data < end) {
    if (*url_data == '+') {
      // decode space
      *(pos++) = ' ';
      ++url_data;
    } else if (*url_data == '%') {
      // decode hex value
      // this is a percent encoded value.
      *(pos++) = (hex_val(url_data[1]) << 4) | hex_val(url_data[2]);
      url_data += 3;
    } else
      *(pos++) = *(url_data++);
  }
  *pos = 0;
  return pos - dest;
}

#undef hex_val

Notice, that if you're going to always use a NUL terminated string, then there is no need to use strlen to calculate the length, since we can simply decode until we meet the NUL character.

i.e.:

static ssize_t decode_url_unsafe(char* dest, const char* url_data) {
  char* pos = dest;
  while (*url_data) {
     // [... same code as above ...]
  }
  *pos = 0;
  return pos - dest;
}

pre-edit

Here's another solution taken form this library I'm working on.

It might not be as short / elegant, but it avoids function calls (sscanf) and allows us to decode strings that aren't NULL terminated, which implicitly offers some protection from overflow.

#define is_hex(c)                                              \
  (((c) >= '0' && (c) <= '9') || ((c) >= 'a' && (c) <= 'f') || \
   ((c) >= 'A' && (c) <= 'F'))
#define hex_val(c) (((c) >= '0' && (c) <= '9') ? ((c)-48) : (((c) | 32) - 87))

static ssize_t decode_url(char* dest, const char* url_data, size_t length) {
  char* pos = dest;
  for (size_t i = 0; i < length; i++) {
    if (url_data[i] == '+')  // decode space
      *(pos++) = ' ';
    else if (url_data[i] == '%') {
      // decode hex value
      if (is_hex(url_data[i + 1]) && is_hex(url_data[i + 2])) {
        // this is a percent encoded value.
        *(pos++) = (hex_val(url_data[i + 1]) << 4) | hex_val(url_data[i + 2]);
        i += 2;
      } else {
        // there was an error in the URL encoding...
        return -1;
      }
    } else
      *(pos++) = url_data[i];
  }
  *pos = 0;
  return pos - dest;
}

Upvotes: 1

Jonathan Leffler

Reputation: 754060

Here's the code I assembled from the answer by Myst and also the answer by chema989, plus a couple of variations of my own, plus the timing test harness, plus some results.

The functions are unimaginatively renamed to decode_1() through decode_5(). Variant 1 is direct from the answer by chema989, using an inline function ishex(). Variant 2 is the same except using the macro isxdigit() from <ctype.h>. The performance difference is not significant, and this is all the more relevant when you realize that there is considerable overhead in the use of sscanf(). Variant 3 is similar to Variants 1 and 2 in some respects, but uses an (inline) function to_hex() to convert a character to the corresponding hex digit (returning -1 if it is ever passed an invalid hex digit — the test code doesn't check this). This is about 10 times as fast as either variant 1 or 2 because it avoids the overhead of sscanf().

Variant 4 is a streamlined adaptation of Variant 3. It avoids testing dec, which is code present in Variants 1-3 that only makes sense if the decode function can be passed a null pointer. However, the code cannot be passed a null pointer safely since the main loop increments it, and then dereferences the no-longer-null pointer — woefully undefined behaviour that usually leads to a crash.

Variant 5 is the code by Myst with the interface converted back to the same as the other 4 variants. The changes are basically trivial — reorder the parameters, redefine the return type (Myst's return type is easily arguably better, but it is inconsistent with the others), and do without the length parameter by checking whether the current element is the null byte (instead of comparing with the no-longer-available length).

The test harness uses some timing functions that provide a largely platform-independent interface to the system's timing functions. I print microsecond resolution results. The test harness carefully modifies the URL string (character offsets 5 and 6) to prevent the optimizer from over-optimizing. The total length report ensures things are consistent. It spotted a problem; variants 4 and 5 returned 21,300,000 while variants 1-3 returned 21,400,000. The - 1 in the return lines of variants 1-3 fixes the discrepancy. The length returned is the length of the destination string as would be reported by strlen(), excluding the terminating null byte.

Code

#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include "timer.h"

static inline int ishex(int x)
{
    return (x >= '0' && x <= '9')  ||
           (x >= 'a' && x <= 'f')  ||
           (x >= 'A' && x <= 'F');
}

static int to_hex(char c)
{
    if (c >= '0' && c <= '9')
        return c - '0';
    else if (c >= 'A' && c <= 'F')
        return c - 'A' + 10;
    else if (c >= 'a' && c <= 'f')
        return c - 'a' + 10;
    else
        return -1;
}

static int decode_1(const char *s, char *dec)
{
    char *o;
    const char *end = s + strlen(s);
    int c;

    for (o = dec; s <= end; ++o)
    {
        c = *s++;
        if (c == '+')
            c = ' ';
        else if (c == '%' && (!ishex(*s++)  ||
                              !ishex(*s++)    ||
                              !sscanf(s - 2, "%2x", &c)))
            return -1;
        if (dec)
            *o = c;
    }
    return o - dec - 1;
}

static int decode_2(const char *s, char *dec)
{
    char *o;
    const char *end = s + strlen(s);
    int c;

    for (o = dec; s <= end; ++o)
    {
        c = *s++;
        if (c == '+')
            c = ' ';
        else if (c == '%' && (!isxdigit(*s++)  ||
                              !isxdigit(*s++)    ||
                              sscanf(s - 2, "%2x", &c) != 1))
            return -1;
        if (dec)
            *o = c;
    }
    return o - dec - 1;
}

static int decode_3(const char *s, char *dec)
{
    char *o;
    const char *end = s + strlen(s);
    int c;

    for (o = dec; s <= end; ++o)
    {
        int c1;
        int c2 = 0;
        c = *s++;
        if (c == '+')
            c = ' ';
        else if (c == '%')
        {
            if ((c1 = to_hex(*s++)) == -1 ||
                (c2 = to_hex(*s++)) == -1)
                return -1;
            else
                c = c1 * 16 + c2;
        }
        if (dec)
            *o = c;
    }
    return o - dec - 1;
}

static int decode_4(const char *s, char *dec)
{
    char *o;
    int c;

    for (o = dec; (c = *s++) != '\0'; ++o)
    {
        if (c == '+')
            c = ' ';
        else if (c == '%')
        {
            int c1 = to_hex(*s++);
            int c2 = to_hex(*s++);
            if (c1 == -1 || c2 == -1)
                return -1;
            else
                c = c1 * 16 + c2;
        }
        *o = c;
    }
    *o = '\0';
    return o - dec;
}

#define is_hex(c)                                              \
  (((c) >= '0' && (c) <= '9') || ((c) >= 'a' && (c) <= 'f') || \
   ((c) >= 'A' && (c) <= 'F'))
#define hex_val(c) (((c) >= '0' && (c) <= '9') ? ((c)-48) : (((c) | 32) - 87))

static int decode_5(const char *url_data, char *dest)
{
    char *pos = dest;
    for (size_t i = 0; url_data[i] != '\0'; i++)
    {
        if (url_data[i] == '+')  // decode space
            *(pos++) = ' ';
        else if (url_data[i] == '%')
        {
            // decode hex value
            if (is_hex(url_data[i + 1]) && is_hex(url_data[i + 2]))
            {
                // this is a percent encoded value.
                *(pos++) = (hex_val(url_data[i + 1]) << 4) | hex_val(url_data[i + 2]);
                i += 2;
            }
            else
            {
                // there was an error in the URL encoding...
                return -1;
            }
        }
        else
            *(pos++) = url_data[i];
    }
    *pos = '\0';
    return pos - dest;
}

enum { MAX_COUNT = 100000 };

static void tester(const char *tag, char *url, int (*decoder)(const char *, char *))
{
    char out[1024];
    unsigned long totlen = 0;
    Clock c;

    clk_init(&c);
    clk_start(&c);
    for (int i = 0; i < MAX_COUNT; i++)
    {
        url[5] = (i % 8) + '1';
        totlen += (*decoder)(url, out);
    }
    clk_stop(&c);

    char buffer[32];
    printf("%8s: %s (%lu)\n", tag, clk_elapsed_us(&c, buffer, sizeof(buffer)), totlen);
}

int main(void)
{
    char url[] =
        "http%3A%2F%2Fmanifest.googlevideo.com%2Fapi%2Fmanifest"
        "%2Fdash%2Fms%2Fau%2Fmt%2F1466992558%2Fmv%2Fm%2Fsver%2F3"
        "%2Fitag%2F0%2Fsignature%2F9811214A6751583E8AAD1951B992D"
        "8E011C91E5C.5DBFF1BD54C73C370B058B8BE27CB5848CEAF391%2F"
        "key%2Fyt6%2Fmn%2Fsn-ipoxu-un56%2Fas%2Ffmp4";

    for (int i = 0; i < 10; i++)
    {
        url[6] = (i % 6) + 'A';
        tester("isxdigit", url, decode_2);
        tester("ishex",    url, decode_1);
        tester("tohex-A",  url, decode_3);
        tester("tohex-B",  url, decode_4);
        tester("hex_val",  url, decode_5);
    }

    return 0;
}

Tweaking the code to set url[6] = (i % 6) + 'A'; before each call to tester() did not make very much difference (but did remove a casual asymmetry between the 5 test regimes). If anything, tohex-B was slightly better than hex_val with the change, but it would have taken many more test runs to determine if that was significant.

Statistics

Tag        N    Average  Std Dev
isxdigit   10   0.3000   0.0126
ishex      10   0.3180   0.0184
tohex-A    10   0.0366   0.0025
tohex-B    10   0.0318   0.0049
hex_val    10   0.0322   0.0043

Clearly, the first two variants are almost 10 times as slow as the last three. The difference between the tohex-A variant (Variant 3) and the last two (Variants 4 and 5) is almost certainly significant. The difference between the last two is not significant — indeed, the relative speeds is occasionally inverted:

Tag        N    Average  Std Dev
isxdigit   10   0.3072   0.0096
ishex      10   0.3185   0.0137
tohex-A    10   0.0387   0.0044
tohex-B    10   0.0322   0.0031
hex_val    10   0.0324   0.0045

Tag        N    Average  Std Dev
isxdigit   10   0.3128   0.0162
ishex      10   0.3225   0.0186
tohex-A    10   0.0374   0.0054
tohex-B    10   0.0311   0.0048
hex_val    10   0.0310   0.0031

For the sample data where the URL encoding uses upper case hex letters, Variant 5 could perhaps be tweaked by testing for [A-F] before [a-f], but the difference won't be great.

JFTR, the compiler was a home-built GCC 6.1.0, the O/S was Mac OS X 10.11.5, and the machine is a early-2011 MacBook Pro with a 2.3 GHz Intel Core i7 processor. (It has 16 GiB 1333 MHz DDR3 main memory, but memory is not a major factor in this test.)

Raw results

This is the raw timing data for the first of the three sets of statistics.

isxdigit: 0.312046 (21300000)
   ishex: 0.333541 (21300000)
 tohex-A: 0.036558 (21300000)
 tohex-B: 0.029301 (21300000)
 hex_val: 0.030238 (21300000)
isxdigit: 0.287166 (21300000)
   ishex: 0.305931 (21300000)
 tohex-A: 0.034781 (21300000)
 tohex-B: 0.028075 (21300000)
 hex_val: 0.028881 (21300000)
isxdigit: 0.280347 (21300000)
   ishex: 0.290434 (21300000)
 tohex-A: 0.037599 (21300000)
 tohex-B: 0.028111 (21300000)
 hex_val: 0.028811 (21300000)
isxdigit: 0.282539 (21300000)
   ishex: 0.297163 (21300000)
 tohex-A: 0.040645 (21300000)
 tohex-B: 0.027662 (21300000)
 hex_val: 0.030026 (21300000)
isxdigit: 0.299307 (21300000)
   ishex: 0.324027 (21300000)
 tohex-A: 0.034579 (21300000)
 tohex-B: 0.027284 (21300000)
 hex_val: 0.041233 (21300000)
isxdigit: 0.312988 (21300000)
   ishex: 0.304933 (21300000)
 tohex-A: 0.034615 (21300000)
 tohex-B: 0.033855 (21300000)
 hex_val: 0.028036 (21300000)
isxdigit: 0.305514 (21300000)
   ishex: 0.341806 (21300000)
 tohex-A: 0.034160 (21300000)
 tohex-B: 0.038685 (21300000)
 hex_val: 0.029262 (21300000)
isxdigit: 0.312998 (21300000)
   ishex: 0.314886 (21300000)
 tohex-A: 0.037663 (21300000)
 tohex-B: 0.030687 (21300000)
 hex_val: 0.036551 (21300000)
isxdigit: 0.307092 (21300000)
   ishex: 0.343648 (21300000)
 tohex-A: 0.040578 (21300000)
 tohex-B: 0.041255 (21300000)
 hex_val: 0.034707 (21300000)
isxdigit: 0.300396 (21300000)
   ishex: 0.323866 (21300000)
 tohex-A: 0.034540 (21300000)
 tohex-B: 0.032675 (21300000)
 hex_val: 0.034564 (21300000)

Myst's Revised Code

Using the revised code from Myst's updated answer as hex_val2 (and retagging the original as hex_val1), the statistics from three runs were:

isxdigit   10   0.3109   0.0127
ishex      10   0.3079   0.0242
tohex-A    10   0.0384   0.0051
tohex-B    10   0.0309   0.0039
hex_val1   10   0.0327   0.0042
hex_val2   10   0.0263   0.0039

isxdigit   10   0.3003   0.0132
ishex      10   0.3079   0.0150
tohex-A    10   0.0398   0.0070
tohex-B    10   0.0311   0.0035
hex_val1   10   0.0310   0.0032
hex_val2   10   0.0285   0.0034

isxdigit   10   0.3055   0.0115
ishex      10   0.3088   0.0155
tohex-A    10   0.0358   0.0030
tohex-B    10   0.0319   0.0045
hex_val1   10   0.0318   0.0031
hex_val2   10   0.0264   0.0030

That looks to be measurably faster.

Additional code

#undef hex_val

#define hex_val(c)                                                        \
  (((c) >= '0' && (c) <= '9') ? ((c)-48) : (((c) >= 'a' && (c) <= 'f') || \
                                            ((c) >= 'A' && (c) <= 'F'))   \
                                         ? (((c) | 32) - 87)              \
                                         : ({                             \
                                              return -1;                  \
                                              0;                          \
                                           }))

static int decode_6(const char *url_data, char *dest)
{
    char *pos = dest;
    while (*url_data != '\0')
    {
        if (*url_data == '+')
        {
            // decode space
            *(pos++) = ' ';
            ++url_data;
        }
        else if (*url_data == '%')
        {
            // decode hex value
            // this is a percent encoded value.
            *(pos++) = (hex_val(url_data[1]) << 4) | hex_val(url_data[2]);
            url_data += 3;
        }
        else
            *(pos++) = *(url_data++);
    }
    *pos = '\0';
    return pos - dest;
}

#undef hex_val

Upvotes: 1

chema989

Reputation: 4172

I think, You need a URL decoding. I share You a URL decoding based in: https://www.rosettacode.org/wiki/URL_decoding#C

#include <stdio.h>
#include <string.h>
#include <ctype.h>

int decode(const char *s, char *dec) {
    const char *end = s + strlen(s);
    int c;

    if (dec) {
        char *o = dec;
        for (; s <= end; ++o) {
            c = *s++;
            if (c == '+') c = ' ';
            else if (c == '%' && (!isxdigit(*s++)  ||
                    !isxdigit(*s++)    ||
                    !sscanf(s - 2, "%2x", &c)))
                return -1;     
            *o = c;
        }
        return o - dec;
    } else {
        int dec_len = 0;
        for (; s <= end; ++dec_len) {
            c = *s++;
            if (c == '+') c = ' ';
            else if (c == '%' && (!isxdigit(*s++)  ||
                    !isxdigit(*s++)    ||
                    !sscanf(s - 2, "%2x", &c)))
                return -1;
        }
        return dec_len;
    }
}

int main() {
    const char *url = "http%3A%2F%2Fmanifest.googlevideo.com%2Fapi%2Fmanifest%2Fdash%2Fms%2Fau%2Fmt%2F1466992558%2Fmv%2Fm%2Fsver%2F3%2Fitag%2F0%2Fsignature%2F9811214A6751583E8AAD1951B992D8E011C91E5C.5DBFF1BD54C73C370B058B8BE27CB5848CEAF391%2Fkey%2Fyt6%2Fmn%2Fsn-ipoxu-un56%2Fas%2Ffmp4";
    char out[strlen(url) + 1];

    printf("length: %d\n", decode(url, NULL));
    puts(decode(url, out) == -1 ? "Bad string" : out);

    return 0;
}

Output:

length: 214
http://manifest.googlevideo.com/api/manifest/dash/ms/au/mt/1466992558/mv/m/sver/3/itag/0/signature/9811214A6751583E8AAD1951B992D8E011C91E5C.5DBFF1BD54C73C370B058B8BE27CB5848CEAF391/key/yt6/mn/sn-ipoxu-un56/as/fmp4

Upvotes: 1

How can I replace the URL-encoded sequences from a string in a C program?

Answers (4)

pre-edit

Code

Statistics

Raw results

Myst's Revised Code

Additional code

Related Questions