polu_
polu_

Reputation: 433

Invalid uninitialized jump or move memory error while trying to split a char32_t string into tokens manually

I am trying to split a char32_t string into tokens separated by a delimiter. I am not using any strtok or other std library function because, it is gurrented that input string and the delimiter will be mulltibyte unicode string.

Here is the function I have written:

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <uchar.h>
#include <wchar.h>

char32_t **sp(char32_t *str, char32_t delim, int *len) {
  *len = 1;
  char32_t *s = str;
  while (*s != U'\0') {
    if (*s == delim) {
      (*len)++;
    }
    s++;
  }
  char32_t **tokens = (char32_t **)malloc((*len) * sizeof(char32_t *));
  if (tokens == NULL) {
    exit(111);
  }

  char32_t * p = str;
  int i = 0;
  while (*p != U'\0') {
    int tok_len = 0;
    while (p[tok_len] != U'\0' && p[tok_len] != delim) {
      tok_len++;
    }
    tokens[i] = (char32_t *)malloc(sizeof(char32_t) * (tok_len + 1));
    if (tokens[i] == NULL) {
      exit(112);
    }
    memcpy(tokens[i], p, tok_len * sizeof(char32_t));
    tokens[i][tok_len] = U'\0';
    p += tok_len + 1;
    i++;
  }
  return tokens;
}

And here is the driver code

int main() {
  char32_t *str = U"Hello,World,mango,hey,";
  char32_t delim = U',';
  int len = 0;
  char32_t ** tokens = sp(str, delim, &len);
  wprintf(L"len -> %d\n", len);
  for (int i = 0; i < len; i++) {
    if (tokens[i]) {
    
    wprintf(L"[%d] %ls\n" , i , tokens[i]);  
    }
    free(tokens[i]);
  }  
  free(tokens);

}

Here is the output:

len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
[4] (null)

But when I check the program with valgrind it show multiple memory errors

valgrind -s --leak-check=full --track-origins=yes ./x3
==7703== Memcheck, a memory error detector
==7703== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==7703== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==7703== Command: ./x3
==7703== 
tok -> 5
tok -> 5
tok -> 5
tok -> 3
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x48FDAF8: __wprintf_buffer (vfprintf-process-arg.c:396)
==7703==    by 0x48FF421: __vfwprintf_internal (vfprintf-internal.c:1459)
==7703==    by 0x490CFAE: wprintf (wprintf.c:32)
==7703==    by 0x1093C9: main (main.c:51)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
[4] (null)
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x4844225: free (vg_replace_malloc.c:884)
==7703==    by 0x1093DA: main (main.c:52)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
==7703== 
==7703== HEAP SUMMARY:
==7703==     in use at exit: 0 bytes in 0 blocks
==7703==   total heap usage: 7 allocs, 7 frees, 5,248 bytes allocated
==7703== 
==7703== All heap blocks were freed -- no leaks are possible
==7703== 
==7703== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
==7703== 
==7703== 1 errors in context 1 of 2:
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x4844225: free (vg_replace_malloc.c:884)
==7703==    by 0x1093DA: main (main.c:52)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
==7703== 
==7703== 1 errors in context 2 of 2:
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x48FDAF8: __wprintf_buffer (vfprintf-process-arg.c:396)
==7703==    by 0x48FF421: __vfwprintf_internal (vfprintf-internal.c:1459)
==7703==    by 0x490CFAE: wprintf (wprintf.c:32)
==7703==    by 0x1093C9: main (main.c:51)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
==7703== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

I am unable to figure out what is the problem. any help will be appreciated

I have also tried with unicode strings the same error also occurs.

Upvotes: 1

Views: 99

Answers (1)

H.S.
H.S.

Reputation: 12679

valgrind is giving those errors because your program is accessing uninitialised memory in last iteration of this for loop in main() function (i.e. while accessing tokens[4], when len value is 5):

  for (int i = 0; i < len; i++) {
     if (tokens[i]) {
        wprintf(L"[%d] %ls\n" , i , tokens[i]);  
     }
     free(tokens[i]);
  }  

malloc function allocate memory and leave it uninitialised. Here, in sp() function, when your program allocating memory it is uninitialised:

  char32_t **tokens = (char32_t **)malloc((*len) * sizeof(char32_t *));

The while loop of sp() function allocate and copy some value to allocated memory for all the members of tokens array except the last member and leaves it uninitialised. In the main(), your program is accessing that uninitialised member and hence the valgrind reporting the error.

To fix the problem, in sp() function, after allocating memory to tokens -
Either make last pointer member of tokens array NULL:

// this is the bare minimum change required to fix the problem
tokens [*len - 1] = NULL;

Or, make all pointers NULL

for (int i = 0; i < *len; ++i) {
   tokens[i] = NULL;
}

Or, use calloc to allocate memory to tokens, which will ensure all the allocated pointers initialised to NULL:

char32_t **tokens = calloc((*len), sizeof(char32_t *));

With any of the above mentioned solutions, valgrind output:

# valgrind -s --leak-check=full --track-origins=yes ./a.out 
==9761== Memcheck, a memory error detector
==9761== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9761== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==9761== Command: ./a.out
==9761== 
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
==9761== 
==9761== HEAP SUMMARY:
==9761==     in use at exit: 0 bytes in 0 blocks
==9761==   total heap usage: 7 allocs, 7 frees, 5,248 bytes allocated
==9761== 
==9761== All heap blocks were freed -- no leaks are possible
==9761== 
==9761== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Found one more problem in your code, when the input string does not have delimiter character as the last character, you program end up accessing input string beyond it's length which results in undefined behaviour. Look at this statement of while loop of sp() function:

p += tok_len + 1;

Assume input string is - U"Hello,World,mango,hey" [note the last character of string is not delimiter ,]. The nested while loop condition will result in false when the p[tok_len] equal to U'\0' while iterating input string and the below statement p += tok_len + 1; will make the pointer p pointing to memory just beyond the input string. The outer while loop condition attempt to dereference the p and it will lead to undefined behaviour.

Replace this statement of while loop of sp() function:

    p += tok_len + 1;

with this

    p += tok_len;
    p += (*p != '\0') ? 1 : 0;

This will first make the pointer p pointing to one character past the end of current tokens in the input string and if that character is not null terminating character then only 1 will be added to pointer p, otherwise not.

The while loop body can be implemented in a much better way and can also be equipped to handle scenarios like, for e.g., taking care of spaces when the words in the input string have space(s) in between them, input string with only delimiters etc. I am leaving it up to you to improve the implementation and to take care of other scenarios.


EDIT:

This is your requirement - if the last character of input string is delimiter then the last member of tokens array should point to empty string, instead of being NULL.
You don't need to handle this as a special scenario after the loop, as you have shown in comment. You can handle this in loop body which is processing the input string and extracting the tokens from it, like this:

char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
    *len = 1;
    for (int i = 0; str[i] != U'\0'; ++i) {
        if (str[i] == delim) (*len)++;
    }

    char32_t **tokens = malloc ((*len) * sizeof (char32_t *));
    if (tokens == NULL) {
        exit(111);
    }

    int start = 0, end = 0, i = 0;
    do {
        if ((str[end] == delim) || (str[end] == U'\0')) {
            tokens[i] = malloc (sizeof (char32_t) * (end - start + 1));
            if (tokens[i] == NULL) {
                exit(112);
            }
            memcpy (tokens[i], &str[start], sizeof (char32_t) * (end - start));
            tokens[i][end - start] = U'\0';
            start = end + 1; i++;
        }
    } while (str[end++] != U'\0');

    return tokens;
}

Few test cases:

Input string:

char32_t *str = U"Hello,World,mango,hey,";

Output:

# ./a.out 
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
[4] 

Input string:

char32_t *str = U"Hello,World,mango,hey";

Output:

# ./a.out 
len -> 4
[0] Hello
[1] World
[2] mango
[3] hey

Input string:

char32_t *str = U",,, , u";

Output:

# ./a.out 
len -> 5
[0] 
[1] 
[2] 
[3]  
[4]  u

Input string:

char32_t *str = U" ";

Output:

# ./a.out 
len -> 1
[0]  

Upvotes: 3

Related Questions