JJ tom
JJ tom

Reputation: 13

Creating a million (or more) string array in C

I'm running into a problem creating an array big enough to store words from a large text document (think books).

Usually, I would just do:

char wordList[1000000][30];

But, as expected the program crashes as soon as it tries to initialize the array. So, I tried a few different things, such as:

char *wordList[30]
int k=0;
while(k<1000000){
   wordList[k]= malloc(sizeof(char)*30);
   k++;
}

This, too, didn't work. So I'm wondering if there is an easier way. I know its possible. For the first option, my research as lead my to believe the the variable is initialized on the stack (which has small memory) and segfaults.

I'm not sure why the second one fails. Any suggestions? I've searched everywhere I could to find an answer, but most of the suggestions are in java or c++ where you just call new, or arraylist etc.

Upvotes: 0

Views: 2631

Answers (2)

In practice, a good rule of thumb when programming in C is that "big" data should be allocated in heap memory.

(I am guessing that you are coding for an ordinary laptop or desktop machine running some common operating system; I'm actually thinking of a desktop Linux computer, like the machine I am answering here; but you could adapt my answer to a desktop running some Windows, to a tablet running some Android, to an Apple computer running some MacOSX)

Notice that your current wordList is 30 megabytes (that is sizeof(wordList)==30000000). That is big! But in some cases not big enough (probably not enough for the whole Saint James bible, and certainly not for the entire laws, decrees, and juridictions of US or of France, or for the archive of a century-old newspaper). You can easily find textual data of more than several dozens of megabytes today, such as all the messages on StackOverflow.

You may need to understand more about current operating systems to understand all of my answer; I recommend reading Operating Systems: Three Easy Pieces and, if your computer runs Linux and you want to code for Linux, Advanced Linux Programming.

You don't want to allocate a big array (or any kind of big data) as local variables (or with alloca(3) ...) on your call stack, because the call stack is limited (typically to one megabyte, or a few of them). On some special computers (think of expensive servers running some specially configured Linux) you could raise that (machine call stack) limit, but perhaps not easily, to several dozens of megabytes. Expecting a gigabyte call stack is not reasonable.

You probably don't want to have huge global or static data (those allocated at compile time in your data segment) of fixed size. If you did that, your program might still lack of memory (because you under-estimated that fixed size) or could not even start on smaller computers (e.g. if your data segment had 20Gbytes, your executable might start on my desktop with 32Gbytes, but would fail to start -at execve(2) time- on your laptop with only 16Gbytes).

The remaining option is usual practice: allocate all "big" data in heap memory by indirectly using primitives growing your virtual address space. In standard C, you'll extensively use malloc and friends (e.g. calloc) - with free to release memory. FWIW, the underlying primitives (to grow the virtual address space) on Linux include mmap(2) and related system calls (and may be called by the malloc implementation on your system). But the standard C dynamic memory allocation techniques (that is malloc & free) are hiding these gory (implementation specific) details in the C standard library (so your code using malloc & free could become portable with efforts from your part).

Read some coding rules, e.g. the GNU ones, related to memory usage, and robust programs, notably:

Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and symbols, by allocating all data structures dynamically

(emphasis is mine)

Practically speaking, your wordList (that is a poor name, it is not a list but a vector or a table) should probably be a dynamically allocated array of pointers to dynamically allocated strings. You could declare it as char**wordList; and you want to keep its allocated size and used length (perhaps in two other global variables, size_t allocatedSize, usedLength; ...). You might prefer to use a struct ending with a flexible array member

Don't forget to check against failure of malloc. Perhaps you want to initialize your data with something like:

allocatedSize=1000;
usedLength=0;
wordList= calloc(allocatedSize, sizeof(char*));
if (!wordList) { perror("initial calloc wordlist"); exit(EXIT_FAILURE); };

Here is a routine to add a new word to your wordList; that routine does not check if the word is indeed new (maybe you want to use some other data structure, like some hash-table, or some self balancing binary search tree); if you want to keep only unique words, read some Introduction to Algorithms. Otherwise, you could use:

void add_new_word(const char*w) {
   if (usedLength >= allocatedSize) {
     size_t newsize = 4*usedLength/3+10;

(heuristically, we don't want to re-allocate wordList too often; hence the "geometrical" growth above)

     char**newlist = calloc(newsize*sizeof(char*));
     if (!newlist) { perror("calloc newlist"); exit(FAILURE); };
     memcpy (newlist, wordList, usedLength*sizeof(char*));
     free (wordList);
     wordList = newlist;
     allocatedSize = newsize;
  };
  // here we are sure that wordList is not full, 
  // so usedLength < allocatedSize
  char *dw = strdup(w);
  if (!dw) { perror("strdup failure"); exit(EXIT_FAILURE); };

we are using the very common strdup(3) function to copy some string into a heap allocated one. If your system don't have that, it is really easy to write, using strlen, malloc, strcpy ...

  wordList[usedLength++] = dw;
} // end of add_new_word

After a call to add_new_word you know that a new word has been added at index usedLength-1 of wordList.

Notice that my add_new_word is checking against failure of malloc (including the one called from strdup) and satisfy the "robustness" criteria: all data is heap allocated!

BTW, some computers are (IMHO wrongly) enabling memory overcommitment. This is a system administration issue. I dislike that feature (because when it is enabled, malloc would never fail, but programs would crash badly when memory resources are exhausted).

FWIW, here is the routine to release memory, to be called near end of program (or registered thru atexit(3))

void destroy_word_list(void) {
  if (!wordList) return;
  for (size_t ix=0; ix<usedLength; ix++) free(wordList[ix]);
  free (wordList);
  usedLength = 0;
  allocatedSize = 0;
  wordList = NULL;
} // end of destroy_word_list

You may want to use valgrind e.g. to debug memory leaks.

Upvotes: 5

RoadRunner
RoadRunner

Reputation: 26315

wordList is an array of 30 char* pointers. You are accessing way beyond the limit of this array. Specifically, you are accessing up to million spaces, but the array only has 30 spaces. This will cause undefined behaviour.

You need to instead make sure wordList has enough space for 1000000 pointers.

This should be instead:

char *wordList[1000000];

This allows flexibility on the length of the words. The only fixed size here is the array size.

If you use:

wordList[k]= malloc(sizeof(char)*30);

Moreover, this will run into issues if the words are more than 29 characters, excluding the \0 character at the end. Although, their are not many words longer than 29 characters. Words as long as:

supercalifragilisticexpialidocious

Are hard to come by.

Furthermore, it depends on how you are reading these words from your text document. If you parse the words, and instead use a temporary buffer to store them, then you can do this:

wordList[k]= malloc(strlen(word)+1); /* +1 for null-terminator */

Which will allocate memory for any sized word you copy into wordList[k]. This will be more efficient for smaller words like "the" and "or", instead of allocating 30 spaces for any word.

Note: Allocating a million pointers on the heap beforehand is also very wasteful, this process should be done on an as needed basis. It might even be better to use char **wordList, to allow more flexibility with how many words you allocate.

For example, you could allocate a starting size:

size_t start_size = 1000;
char **wordList = malloc(start_size * sizeof(*wordlist));

Then if more words are found, you can realloc() more space as needed.

realloc() resizes block of memory it points to, and returns a pointer.

An example would be:

if (start_size == word_count) {
    start_size *= 2;
    wordList = realloc(wordList, start_size * sizeof(*wordList));
    if (wordList == NULL) {
        /* handle exit */

Which would return a pointer which holds start_size * 2 spaces.

You should also check the return of malloc() and realloc(), as they can return NULL if unsuccessful. At the end of your program, you should also free() the pointers allocated from malloc().

Upvotes: 6

Related Questions