aelgoa
aelgoa

Reputation: 1201

c string construction

I'm trying to parse argv back into a single string to use for a system call. The string shows up fine through the printf call (even without the \0 terminator), but using it as a parameter to system creates all sorts of undefined behaviour.

How can I ensure the string is properly terminated? Is there a better and more reliable way to go about parsing char[][] into char[]?

#include <stdio.h>
int main(int argc,char *argv[]){

char cmd[255]="tcc\\tcc.exe ";
char**ptr=argv+1;
while(*ptr){
   strcat(cmd,*ptr);
   ++ptr;
   if(*ptr)cmd[strlen(cmd)]=' ';
}


printf("cmd: ***%s***\n",cmd);
system(cmd);
}

I just discovered and corrected another flaw in this code, a system commands needs (escaped) backslashes for file paths

Upvotes: 0

Views: 867

Answers (2)

LSerni
LSerni

Reputation: 57453

This instruction:

   if(*ptr)
       cmd[strlen(cmd)]=' ';
   else
       cmd[strlen(cmd)]='\0';

will break cmd, because it will overwrite its zero termination. Try instead:

l = strlen(cmd);
if (*ptr) {
    cmd[l++] = ' ';
}
cmd[l] = 0x0;

This will append a space, and zero terminate the string. Actually, since it is already zero terminated, you could do better:

if (*ptr) {
    int l = strlen(cmd);
    cmd[l++] = ' ';
    cmd[l  ] = 0x0;
}

Update

A better alternative could be this:

int main(int argc, char *argv[])
{
        char cmd[255]="tcc/tcc.exe";
        char **ptr=argv+1;
        for (ptr = argv+1; *ptr; ptr++)
        {
            strncat(cmd, " ",  sizeof(cmd)-strlen(cmd));
            strncat(cmd, *ptr, sizeof(cmd)-strlen(cmd));
        }
        printf("String: '%s'.\n", cmd);
        return 0;
}

We use strncat() to check that we're not overrunning the cmd buffer, and the space gets applied in advance. This way there's no extra space at the end of the string.

It is true that strncat() is a mite slower than directly assigning cmd[], but factoring the safety and debugging time, I think it's worthwhile.

Update 2

OK, so let's try to do this fast. We keep track of what cmd's length ought to be in a variable, and copy the string with memcpy() which is slightly faster than strcpy() and does neither check string length, nor copy the extra zero at end of string.

(This saves something - remember that strcat() has to implicitly calculate the strlen of both its arguments. Here we save that).

int main(int argc, char *argv[])
{
        #define MAXCMD  255
        char    cmd[MAXCMD]="tcc/tcc.exe";
        int     cmdlen  = strlen(cmd);
        char    **ptr=argv+1;
        for (ptr = argv+1; *ptr; ptr++)
        {
            /* How many bytes do we have to copy? */
            int l = strlen(*ptr);
            /* STILL, this check HAS to be done, or the program is going to crash */
            if (cmdlen + 1 + l + 1 < MAXCMD)
            {
                /* No danger of crashing */
                cmd[cmdlen++]   = ' ';
                memcpy(cmd + cmdlen, *ptr, l);
                cmdlen  += l;
            }
            else
            {
                printf("Buffer too small!\n");
            }
        }
        cmd[cmdlen] = 0x0;
        printf("String: '%s'.\n", cmd);
        return 0;
}

Update 3 - not really recommended, but fun

It is possible to try and be smarter than the compiler's usually built-in strlen and memcpy instructions (file under: "Bad ideas"), and do without strlen() altogether. This translates into a smaller inner loop, and when strlen and memcpy are implemented with library calls, much faster performances (look ma, no stack frames!).

int main(int argc, char *argv[])
{
        #define MAXCMD  254
        char    cmd[MAXCMD+1]="tcc/tcc.exe";
        int     cmdlen  = 11; // We know initial length of "tcc/tcc.exe"!
        char    **ptr;
        for (ptr = argv+1; *ptr; ptr++)
        {
            cmd[cmdlen++] = ' ';
            while(**ptr) {
                cmd[cmdlen++] = *(*ptr)++;
                if (MAXCMD == cmdlen)
                {
                        fprintf(stderr, "BUFFER OVERFLOW!\n");
                        return -1;
                }
            }
        }
        cmd[cmdlen] = 0x0;
        printf("String: '%s'.\n", cmd);
        return 0;
}

Discussion - not so fun

Shamelessly cribbed from many a lecture I received from professors I thought shortsighted, until they were proved right each and every time.

The problem here is to exactly circumscribe what are we doing - what's the forest this particular tree is part of.

We're building a command line that will be fed to a exec() call, which means that the OS will have to build another process environment and allocate and track resources. Let's step a bit backwards: an operation will be run that will take about one millisecond, and we're feeding it a loop that might take ten microseconds instead of twenty.

The 20:10 (that's 50%!) improvement we have on the inner loop translates in a 1020:1010 (that's about 1%) just the overall process startup operation. Let's imagine the process takes half a second - five hundred milliseconds - to complete, and we're looking at 500020:500010 or 0.002% improvement, in accord with the never-sufficiently-remembered http://en.wikipedia.org/wiki/Amdahl%27s_law .

Or let's put it another way. One year hence, we will have run this program, say, one billion times. Those 10 microseconds saved now translate to a whopping 10.000 seconds, or around two hours and three quarters. We're starting to talk big, except that to obtain this result we've expended sixteen hours coding, checking and debugging :-)

The double-strncat() solution (which is actually the slowest) supplies code that is easier to read and understand, and modify. And reuse. The fastest solution, above, implicitly relies on the separator being one character, and this fact is not immediately apparent. Which means that reusing the fastest solution with ", " as separator (let's say we need this for CSV or SQL) will now introduce a subtle bug.

When designing an algorithm or piece of code, it is wise to factor not only tightness of code and local ("keyhole") performances, but also things like:

  • how that single piece affects the performances of the whole. It makes no sense to spend 10% of development time on less than 10% of the overall goal.
  • how easy it is for the compiler to interpret it (and optimize it with no further effort on our part, maybe even optimize specifically for different platforms -- all at no cost!)
  • how easy will it be for us to understand it days, weeks, or months down the line.
  • how non-specific and robust the code is, allowing to reuse it somewhere else (DRY).
  • how clear its intent is - allowing to reengineer it later, or replace with a different implementation of the same intent (DRAW).

A subtle bug

This in answer to WilliamMorris's question, so I'll use his code, but mine has the same problem (actually, mine is - not completely unintentionally - much worse).

This is the original functionality from William's code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    #define CMD "tcc/tcc.exe"
    char cmd[255] = CMD;
    char *s = cmd + sizeof CMD - 1;
    const char *end = cmd + sizeof cmd - 1;
    // Cycle syntax modified to pre-C99 - no consequences on our code
    int i;
    for (i = 1; i < argc; ++i) {
        size_t len = strlen(argv[i]);
        if (s + len >= end) {
            fprintf(stderr, "Buffer overrun!\n");
            exit(1);
        }
        // Here (will) be dragons
        //*s++ = '.';
        //*s++ = '.';
        //*s++ = '.';

        *s++ = ' ';
        memcpy(s, argv[i], len);
        s += len;
    }
    *s = '\0';
    // Get also string length, which should be at most 254
    printf("%s: string length is %d\n", cmd, (int)strlen(cmd));
    return 0;
}

The buffer overrun check verifies that the string written so far, plus the string that has yet to be written, together do not exceed the buffer. The length of the separator itself is not counted, but things will work out somehow:

        size_t len = strlen(argv[i]);
        if (s + len >= end) {
            fprintf(stderr, "Buffer overrun!\n");
            exit(1);
        }

Now we add on the separator in the most expeditious way - by repeating the poke:

        *s++ = ', ';
        *s++ = ' ';

Now if s + len is equal to end - 1, the check will pass. We now add two bytes. The total length will be s + len + 2, which is equal to end plus one:

tcc/tcc.exe, It, was, the, best, of, times, it, was, the, worst, of, times, it, was, the, age, of, wisdom, it, was, the, age, of, foolishness, it, was, the, epoch, of, belief, it, was, the, epoch, of, incredulity, it, was, the, season, of, Light, it, was: string length is 254

tcc/tcc.exe, It, was, the, best, of, times, it, was, the, worst, of, times, it, was, the, age, of, wisdom, it, was, the, age, of, foolishness, it, was, the, epoch, of, belief, it, was, the, epoch, of, incredulity, it, was, the, season, of, Light, it, ouch: string length is 255

With a longer separator, such as "... ", the problem is even more evident:

tcc/tcc.exe... It... was... the... best... of... times... it... was... the... worst... of... times... it... was... the... age... of... wisdom... it... was... the... age... of... foolishness... it... was... the... epoch... of... belief... it... was... longer: string length is 257

In my version, the fact that the check requires an exact match leads to catastrophic results, since once the buffer is overrun, the match will always fail and result in a massive memory overwrite.

If we modify my version with

if (cmdlen >= MAXCMD)

we will get a code that always intercepts buffer overruns, but still does not prevent them up to the delimiter's length minus two; i.e., a hypothetical delimiter 20 bytes long could overwrite 18 bytes past cmd's buffer before being caught.

I would point out that this is not to say that my code had a catastrophic bug (and so, once fixed, it'll live happy ever after); the point was that the code was structured in such a way that, for the sake of squeezing a little speed, a dangerous bug could easily go unnoticed, or the same bug could easily be introduced upon reuse of what looked like "safe and tested" code. This is a situation that one would be well advised to avoid.

(I'll come clean now, and confess that I myself rarely did... and too often still don't).

Upvotes: 3

William Morris
William Morris

Reputation: 3684

This might be a more complicated than you want but it avoids buffer overflows and it also exits if the buffer is too small. Note that continuing to loop once there is not enough space in the buffer for argv[N] can result in trailing strings (argv[N+1] etc) that are shorter than argv[N] being added to the string even though argv[N] was omitted...

Note that I'm using memcpy, because by that point I already know how long argv[i] is.

int main(int argc, char **argv)
{
#define CMD "tcc/tcc.exe"
    char cmd[255] = CMD;
    char *s = cmd + sizeof CMD - 1;
    const char *end = cmd + sizeof cmd - 1;

    for (int i = 1; i < argc; ++i) {
        size_t len = strlen(argv[i]);
        if (s + len >= end) {
            fprintf(stderr, "Buffer overrun!\n");
            exit(1);
        }
        *s++ = ' ';
        memcpy(s, argv[i], len);
        s += len;
    }
    *s = '\0';
    printf("%s\n", cmd);
    return 0;
}

Upvotes: 1

Related Questions