nitowa
nitowa

Reputation: 1117

C read file(s) line-by-line into array of Strings and sort

So I want to create a basic C application mysort that takes a list of files, reads each of them line by line into a buffer and sorts the lines alphabetically. The code looks something likes this (plus parameter parsing, etc):

//How do I initialize an array of 1024byte-Strings with an unknown amount of fields?
char** lines; 
int lineNum = 0;

for(int num_files = j; num_files < argc; num_files++){ //iterate through all files
  FILE * filepointer ;
  char * line = NULL;
  size_t len = 0;
  ssize_t read;

  filepointer = fopen(argv[num_files], "r");    
  if (filepointer == NULL)
    exit(EXIT_FAILURE);

  //TODO: write each line into a new spot of the array, this try doesn't work!

  while ((read = getline(&line, &len, filepointer)) != -1) { 
    //the lines may be assumed to be a max of 1024 bytes
    lines[lineNum] = malloc(1024 * sizeof(char)); 
    //lines[lineNum] = line;
    strcpy(lines[lineNum], line);
    lineNum++;
  }

  fclose(fp);
  if (line)
    free(line);

  //These values might be wrong, but that isn't the issue I'm adressing
  //just for illustration
  qsort(lines , argc - 1, sizeof(char *), cmpstringp) 

  //do something with the sorted lines
}

Since I have to use qsort(3), I need to produce a char** holding all the lines at some point.

What's a good way to accomplish such a task? Do I need my own data structure in order to dynamically store several identical objects?

The lines char** Array isn't initialized here, so the program doesn't work. But since the number of lines is completely unknown at the start of the program it may not be explicitly defined (Unless you know a clever function to figure this out)

The only ways I figured out so far is defining my own dynamic datastructure (e.g. LinkedList) or to parse all files twice in order to determine the number of lines that will be produced.

Both seem very un-elegant to me, but maybe I'm just not accustomed to C code.

Upvotes: 2

Views: 2252

Answers (2)

John Bollinger
John Bollinger

Reputation: 181907

//How do I initialize an array of 1024byte-Strings with an unknown amount of fields?

Obviously, you don't. If you initialize something, then at that point you know all the details of that thing.

I suppose you're asking how to reserve memory for an unknown number of string pointers, but again, you don't. Moreover, note that the 1024-byte restriction is unnecessary for an array of char * such as you propose; it would be relevant only if you intended to structure the data as a 2D array of char. After you have read a string, you know how much space it requires, so for example, I observe that this code ...

    //the lines may be assumed to be a max of 1024 bytes
    lines[lineNum] = malloc(1024 * sizeof(char)); 
    //lines[lineNum] = line;
    strcpy(lines[lineNum], line);

... would be both simpler and without inherent size limit if it were written as:

    lines[linenum] = strdup(line);

In fact, that would use less space, too, in the event that your lines average fewer than 1023 characters.

With respect to space for the overall array, what you can do is reserve memory in increments as you go. That could mean initially malloc()ing space for several strings, and realloc()ing to get more space when needed. It could also mean initially reading the strings into a linked list of either individual strings or fixed-size arrays of strings, and then building your monolithic array after you know how many strings there are.

The linked list alternative transiently requires twice as much storage for the string pointers, but that's not too bad because the string contents do not need to be duplicated. This has the advantage of relatively low memory allocation cost relative to some naive implementations of the malloc() / realloc() approach.

Because reallocation usually requires copying all the data (in this case, the pointers) from the one block to a new, larger one, you generally want to limit the number of reallocations. The usual strategy for this in a case such as yours is to step up the allocation sizes geometrically instead of linearly. That is, each time you find you need more space, you allocate new space sufficient for, say, twice as many strings as you already have. The total cost for that scales linearly in the number of data. Although it may seem wasteful in the event that it turns out you needed only a little more space, it still doesn't require any more space than a linked list + transformation to dynamic array would require.

Upvotes: 1

Rorsch
Rorsch

Reputation: 103

Two ways i see of solving the problem:

1) Go through the file, counting the number of new line characters(and saving it into nl_count), then you can allocate lines like this.

int nl_count = 0;
int c;

while ((c = fgetc(fp)) != EOF)
   if (c == '\n')
      nl_count++;
...
lines = malloc(nl_count * sizeof(char *));


This way you will have to cover some special cases in your cmpstringp function, cause u may get some lines which only contain '\n'.
(edit1. Actually in either case you will have to check for this special case.)
(edit2. You can get off by one bug, cause last line doesn't have to end with '\n'.)

2) Set some base size for lines and reallocate for more space when the actual number of lines read reaches this base size.

#define BASE_SIZE 32
#define GROW_STEP 2

int size;

size = BASE_SIZE
lines = malloc(size * sizeof(char *));

lines_read = 0;
while ((read = getline(&line, &len, fp)) != -1) { 
   lines_read++;
   if (lines_read > size) {
       size *= GROW_STEP;
       lines = realloc (lines, size * sizeof (char *));
   }
   lines[lineNum] = strdup(line);
   lineNum++;
}

Notice that in worst case you will allocate twice as much space than it is really needed.
Also, you should free memory allocated if u use strdup().

...
for (i = 0; i < lines_read; i++)
    free(lines[i]);

Upvotes: 2

Related Questions