Useless
Useless

Reputation: 29

How can I get a file's size in C without using either fseek or stat?

I'm doing a project for my school and I can't find out how to get the size of a file. Since I need to read a script and use it in my program, I need the size of the file to use either read or fread.

Here is what I've done to get the file size but it doesn't seem to work.

int my_size(int filedesc)
{
    int size = 1;
    int read_output = 1;
    char *buffer;

    for (size = 1; read_output != 0 ; size++) {
        buffer = malloc((size+1)*sizeof(char*));
        read_output = read(filedesc, buffer, size);
        free(buffer);
    }
    return(size);
}

And I'm not allowed to use stat() nor fseek() as rules for this project nor can I use read or fread with an arbitrary size like 100 since scripts given can be either small or big.

Upvotes: 1

Views: 1180

Answers (2)

John Bollinger
John Bollinger

Reputation: 181244

If you can rely on the input to be a persistent file (i.e. residing on storage media), and on that file not being modified during your program's run, then you could pre-read it to the end to count the bytes in it, then rewind.

But outside of an academic exercise, the usual reason to forbid measuring the size via stat(), fseek(), and similar is that the input might not reside on storage media, so that

  1. you cannot determine its size without reading it, but also
  2. you cannot rewind it or seek within it.

The trick then is not how to determine the size in advance, but rather how to do without measuring the size in advance. There are at least two main strategies for that:

  • Don't rely on storing the whole contents in memory at once in the first place. Instead, operate on its contents as they are read, maintaining only enough in memory at any given time to do so.

  • Alternatively, adapt dynamically to the file size. There are many variations on this. For example, if you're just reading the file into a monolithic block then you can malloc() space and realloc() when you find you need more. Or you could store the contents in a linked list, allocating new list nodes as needed.

As for the approach presented in the question, there are several issues with it. It appears to be an attempt to do as I first described -- reading the file to the end to determine its size -- but

  1. It seems to assume that each read() will start at the beginning of the file, or perhaps that read() will fail if it cannot read the full file. Neither is the case. Each read() will start at the file's current position, and will leave the file positioned after the last byte transferred.

  2. Because it changes the file position, your approach will require the file to be rewound after -- via lseek(), for example. But if lseek() can be used for that purpose (and note well my previous comments with respect to files in which you cannot seek), then it would provide a much cleaner approach to measuring the file's size.

  3. You do not account for I/O errors. If one occurred then it would probably send your program into an infinite loop.

  4. Dynamic allocation is comparatively expensive, and you're doing a whole lot of it. If you want to implement the pre-reading strategy, then this would be a better implementation:

    ssize_t count_bytes(int fd) {
        ssize_t num_bytes = 0;
        char buffer[2048];
        ssize_t result;
    
        do {
            result = read(fd, buffer, sizeof(buffer));
            if (result < 0) {
                // handle error ...
            }
            num_bytes += result;
        while (result > 0);
    
        return num_bytes;
    }
    

Upvotes: 7

Use the gdb debugger, or strace(1), on your executable, to be compiled with all warnings and debug info : gcc -Wall -Wextra -g with GCC. Read carefully the documentation of read(2), and of every function you are using (including malloc(3), whose failure you forgot to test).

You need to use the result (actually read byte count) of read(2). And you need to handle the error case (when read gives -1) specially.

What is probably happenning, with a long enough file, is that on the first loop you are reading 1 byte, on the second loop you are reading 2 bytes, on the third loop you have read 3 bytes, etc... (and you forgot to compute 1+2+3 in that case).

You should cumulate and sum all the read_output and you should handle the case when read(2) gives less than the size (this should happen the last time your read gave non zero).

I would instead suggest using a fixed buffer (of constant or fixed size), and repeatedly do a read(2) but carefully using the returned byte count (also, handle errors, and EOF condition).

Be aware that system calls (listed in syscalls(2)) are quite expensive. As a rule of thumb, you should read(2) or write(2) a buffer of several kilobytes (and handle carefully the returned byte count, also testing it against errors, see errno(3)). A program read-ing only a few bytes at once each time is inefficient.

Also, malloc (or realloc) is quite expensive. Incrementing the heap allocated size by one is ugly (since you call malloc on every loop; in your case you don't even need to use malloc). You'll better use some geometric progression, perhaps newsize = 4*oldsize/3 + 10; (or similar).

Upvotes: 2

Related Questions