Zeno Raiser
Zeno Raiser

Reputation: 217

C extract words from a txt file except spaces and punctuations

I'm trying to extract the words from a .txt file which contains the following sentence

Quando avevo cinqve anni, mia made mi perpeteva sempre che la felicita e la chiave della vita. Quando andai a squola mi domandrono come vuolessi essere da grande. Io scrissi: selice. Mi dissero che non avevo capito il corpito, e io dissi loro che non avevano capito la wita.

The problem is that in the array that I use to store the words, it stores also empty words ' ' which come always after one of the following ',' '.' ':'

I know that things like "empty words" or "empty chars" don't make sense but please try the code with the text that I've passed and you'll understand.

Meanwhile I'm trying to understand the use of sscanf with this modifier sscanf(buffer, "%[^.,:]"); that should allow me to store strings ignoring the . and , and : characters however I don't know what should i write in %[^] to ignore the empty character ' ' which always gets saved.

The code is the following

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static void load_array(const char* file_name){
  char buffer[2048];
  char a[100][100];
  int buf_size = 2048;
  FILE *fp;
  int j = 0, c = 0;

  printf("\nLoading data from file...\n");

  fp = fopen(file_name,"r"); 

  if(fp == NULL){
    fprintf(stderr,"main: unable to open the file");
    exit(EXIT_FAILURE);
  }

  fgets(buffer,buf_size,fp);

  //here i store each word in an array of strings when I encounter 
  //an unwanted char I save the word into the next element of the 
  //array    
  for(int i = 0; i < strlen(buffer); i++) {    

    if((buffer[i] >= 'a' && buffer[i] <= 'z') || (buffer[i] >= 'A' && buffer[i] <= 'Z')) {
        a[j][c++] = buffer[i];  
    } else {
        j++;
        c = 0;
        continue;
    }
  }

  //this print is used only to see the words in the array of strings
  for(int i = 0; i < 100; i++) 
    printf("%s  %d\n", a[i], i);

  fclose(fp);
  printf("\nData loaded\n");
}

//Here I pass the file_name from command line
int main(int argc, char const *argv[]) {
  if(argc < 2) {
    printf("Usage: ordered_array_main <file_name>\n");
    exit(EXIT_FAILURE);
  }

  load_array(argv[1]);

}

I know that I should store only the necessary number and words and not 100 everytime, I want to think about that later on, right now I want to fix the issue with the empty words.

Compilation and execution

gcc -o testloadfile testloadfile.c

./testloadfile "correctme.txt"

Upvotes: 0

Views: 1782

Answers (2)

AndersK
AndersK

Reputation: 36082

you could instead try to use strtok

fgets(buffer,buf_size,fp);
for (char* tok = strtok(buffer,".,: "); *tok; tok = strtok(NULL,".,: "))
{
   printf("%s\n", tok);
}

Note that if you want to store what strtok returns you need to either copy the contents of what tok points to or allocate a copy using strdup/malloc+strcpy since strtok modifies its copy of the first argument as it parses the string.

Upvotes: 1

Tom&#39;s
Tom&#39;s

Reputation: 2506

You forgot to add the final '\0' in each of a's line, and your algorithm have many flaw (like how you increment j each time a non-letter appear. What if you have ", " ? you increment two time instead of one).

One "easy" way is to use "strtok", as Anders K. show you.

fgets(buffer,buf_size,fp);
for (char* tok = strtok(buffer,".,:"); *tok; tok = strtok(NULL,".,:")) {
   printf("%s\n", tok);
}

The "problem" of that function, is that you have to specify all the delimiter, so you have to add ' ' (space), '\t' (tabulation) etc etc.

Since you only want "word" as described by "contain only letter, minuscule or majuscule", then you can do the following:

int main(void)
{
    char line[] = "Hello ! What a beautiful day, isn't it ?";

    char *beginWord = NULL;

    for (size_t i = 0; line[i]; ++i) {
        if (isalpha(line[i])) { // upper or lower letter ==> valid character for a word
            if (!beginWord) {
                // We found the beginning of a word
                beginWord = line + i;
            }
        } else {
            if (beginWord) {
                // We found the end of a word
                char tmp = line[i];
                line[i] = '\0';
                printf("'%s'\n", beginWord);
                line[i] = tmp;
                beginWord = NULL;
            }
        }
    }

    return (0);
}

Note that how "isn't" is splitted in "isn" and "t", since ' is not an accpeted character for your word.

The algo is pretty simple: we just loop the string, and if it's a valid letter and beginWord == NULL, then it's the beginning of the word. If it's not a valid letter and beginWord != NULL, then it's the end of a word. Then you can have every number of letter between two word, you still can detect cleanly the word.

Upvotes: 0

Related Questions