AJ-Williams1
AJ-Williams1

Reputation: 126

Why is this program allocating more memory than necessary?

I am writing a program in C that needs to read from stdin. I don't want it to allocate more memory than necessary, so I am reading the input in chunks, mallocing more memory each time a new chunk is read.

Here is the code (the allocd variable is just for tracking how much memory it has allocated):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZ 20

int main(int argc, char *argv[])
{
    char *str = malloc(1), *p = NULL;
    *str = '\0';
    char buf[SIZ];
    int bufs = 0;
    int allocd = 0;

    while (p = fgets(buf, sizeof(buf), stdin))
    {
        /* grow str */
        str = realloc(str, bufs * SIZ + SIZ);
        allocd = bufs * SIZ + SIZ;
        strcat(str, buf);
        bufs++;

        if (!p)
            break;
    }

    printf("ALLOC'D: %i", allocd);

    free(str);
}

For testing, I have a file called file.txt, which has 966 characters, as you can see when I use wc:

$ wc -m file.txt
966 file.txt

The problem is that my program seems to be allocating far more bytes of memory than there are characters in the file, as you can see:

$ ./code <file.txt
ALLOC'D: 1680

Why is this happening, and how can I fix it?

Upvotes: 1

Views: 102

Answers (2)

Luis Colorado
Luis Colorado

Reputation: 12708

I am writing a program in C that needs to read from stdin. I don't want it to allocate more memory than necessary, so I am reading the input in chunks, mallocing more memory each time a new chunk is read.

Well, your savings intentions are good idea for a programmer, but you are wrong in the savings, because you are not taking into account many things that are hidden to you, but necessary to support an efficient implementation to malloc.

  • The first is that malloc needs to associate extra memory to the block you request, in order to maintain the heap and don't get messed up with the allocation tasks. This means that, let's say that the structure it associates to each bunch of memory you request be a constant and let's say that it is 8 bytes large, the malloc(1) will need to use 8bytes + 1 (this last the one you requested) to manage your task. This means that if you make a million such allocations you will have 1 million bytes allocated in your accountability, but you will be wasting 8 millions in malloc overhead. The number of mallocs you have active counts.
  • The second is that when you malloc, you are adding to the total overhead the size of the pointer you use to remember the place malloc gave to you. This is not accounted in the last place, because you could make only one alloc to store an array, store a million structures contiguous in that array, and reference them with only a pointer. But this is frequently of no use if you are those pointers to make references between objects, you will need to include all those pointers in the accounting. If we add this overhead to the one million bytes allocated above you will be incurring in an extra overhead of 4-8million bytes more. This means you have one million bytes allocated but for maintaining those you need an extra 8 million bytes overhead for you, and 8 milliono bytes overhead hidden into malloc.
  • The initial malloc(1) in your code can be avoided. If you read the documentation of realloc(), you'll see that realloc doesn't need to have a non-null pointer to operate, if you pass a NULL pointer to it, it will behave like the initial malloc() call, but whith the real amount of storage you need.

The approach in your code is correct, you use a single active malloc all the time, you have decided to grow in steps of SIZ (a large SIZ is good to minimize the overhead of malloc calls, but you will, in the average, incurr in an overhead of not used memory ---memory allocated, but not filled with characters, of around half of the value of SIZ, probably more) As the line length is supposed to follow a poison distribution, the best value for SIZ would be the average line length (or better if you use twice that average, for better performance)

Your code, once corrected, would be:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZ 60 /* assumed an average line length of 30 chars */

int main(int argc, char *argv[])
{
    char *str = NULL; /* <<< use null, don't allocate something you don't need */
    char buf[SIZ];
    /* you don't need to know how many times you repeated the loop */
    int allocd = 0; /* allocated capacity */
    int strsz = 0;  /* filled size */

    while (fgets(buf, sizeof(buf), stdin)) /* the p pointer is not necessary */
    {
        /* grow str */
        int read_chars = strlen(buf);           /* (1 & 2) see below */
        printf("read: [%s]\n", buf);
        int pos_to_cp  = strsz;                 /* (3) we need this at the end
*/
        strsz         += read_chars;
        if (strsz >= allocd) {                  /* need to grow */
            printf("growing from %d to %d\n", allocd, allocd + (int)sizeof buf);
            allocd    += sizeof buf;            /* new size */
            str        = realloc(str, allocd);  /* reallocate to allocd */
        }
        strcpy(str + pos_to_cp, buf);           /* (3) see below */
                                                /* (4) see below */
    }

    printf("ALLOC'D: %i\n", allocd);
    printf("string: %s\n", str);

    free(str);
}

(1) The read_chars represents the size of the read string, and it will mark the point where we need to copy the string in buf.

(2) We don't use a pointer variable here, because as a result of realloc, the original pointer can change, so we must evaulate the point of copy once we have the new pointer.

(3) We use pointer arithmetic here to find the point to copy the string. This way we end copying always (at the same cost) a simple string of size sizeof buf, and not appending to a longer and longer string as we iterate on the buffer.

(4) You don't need to check for if (!p) because if p is NULL you never enter in the loop, so the check is useless.

The problem with your program was that your were assuming that the buffer was always filled, so you always needed to grow, which is not true, while fgets stops at the receiving of one \n character. So the grow of the buffer is not always needed. I have interspersed some traces in the program, so you can follow it on execution.

Upvotes: 1

tadman
tadman

Reputation: 211740

You're allocating a new chunk of memory for each line you (attempt to) read regardless of how long the line actually is. I'd argue fgets is the wrong tool here, that what you want is fread instead:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZ 20

int main(int argc, char *argv[])
{
    char *str = malloc(1);
    *str = '\0';
    char buf[SIZ];
    int allocd = 0;

    int p;

    // Note: fread() returns size_t number of records read, NOT a char*
    while ((p = fread(buf, 1, sizeof(buf), stdin)))
    {
        str = realloc(str, allocd + p + 1);

        // Concatenate the buffer
        memcpy(str + allocd, buf, p);

        allocd += p;
    }

    str[allocd + 1] = 0;

    printf("ALLOC'D: %i", allocd);

    free(str);
}

Upvotes: 3

Related Questions