Reputation: 522

How to count the frequency of the two first letters in a word from a dictionary?

I have a 143k lowcase word dictionary and I want to count the frequency of the first two letters (ie: aa* = 14, ab* = 534, ac = 714 ... za = 65, ... zz = 0 ) and put it in a bidimensional array.

However I have no idea how to even go about iterating them without switches or a bunch of if elses I tried looking on google for a solution to this but I could only find counting amount of letters in the whole word and mostly only things in python.

I've sat here for a while thinking how could I do this and my brain keeps blocking this was what I came up with but I really don't know where to head.

int main (void) {
    char *line = NULL;
    size_t len = 0;
    ssize_t read;
    char *arr[143091];
    
    FILE *fp = fopen("large", “r”);
    if (*fp == NULL)
    {
        return 1;
    }
    
    int i = 0;
    while ((read = getline(&line, &len, fp)) != -1)
    {
        arr[i] = line;
        i++;
    }
    
    char c1 = 'a';
    char c2 = 'a';
    i = 0;
    int j = 0;
    while (c1 <= 'z')
    {
        while (arr[k][0] == c1)
        {
            while (arr[k][1] == c2)
            {
                
            }
            c2++;
        }
        c1++;
    }
    fclose(fp);
    if (line)
        free(line);
    return 0;
}

Am I being an idiot or am I just missing someting really basic? How can I go about this problem?

Edit: I forgot to mention that the dictionary is only lowercase and has some edge cases like just an a or an e and some words have ' (like e'erand e's) there are no accentuated latin characters and they are all accii lowercase

Upvotes: 2

Answers (6)

Bodo

Reputation: 9875

The code assumes that the input has one word per line without leading spaces and will count all words that start with two ASCII letters from 'a'..'z'. As the statement in the question is not fully clear, I further assume that the character encoding is ASCII or at least ASCII compatible. (The question states: "there are no accentuated latin characters and they are all accii lowercase")

If you want to include words that consist of only one letter or words that contain ', the calculation of the index values from the characters would be a bit more complicated. In this case I would add a function to calculate the index from the character value. Also for non-ASCII letters the simple calculation of the array index would not work.

The program reads the input line by line without storing all lines, checks the input as defined above and converts the first two characters from range 'a'..'z' to index values in range 0..'z'-'a' to count the occurrence in a two-dimensional array.

#include <stdio.h>
#include <stdlib.h>

int main (void) {
    char *line = NULL;
    size_t len = 0;
    ssize_t read;

    /* Counter array, initialized with 0. The highest possible index will 
     * be 'z'-'a', so the size in each dimension is 1 more */
    unsigned long count['z'-'a'+1]['z'-'a'+1] = {0};
    
    FILE *fp = fopen("large", "r");
    if (fp == NULL)
    {
        return 1;
    }
    
    while ((read = getline(&line, &len, fp)) != -1)
    {
        /* ignore short input */
        if(read >= 2)
        {
            /* ignore other characters */
            if((line[0] >= 'a') && (line[0] <= 'z') &&
               (line[1] >= 'a') && (line[1] <= 'z'))
            {
                /* convert first 2 characters to array index range and count */
                count[line[0]-'a'][line[1]-'a']++;
            }
        }
    }
    
    fclose(fp);
    if (line)
        free(line);

    /* example output */
    for(int i = 'a'-'a'; i <= 'z'-'a'; i++)
    {
        for(int j = 'a'-'a'; j <= 'z'-'a'; j++)
        {
            /* only print combinations that actually occurred */
            if(count[i][j] > 0)
            {
                printf("%c%c %lu\n", i+'a', j+'a', count[i][j]);
            }
        }
    }

    return 0;
}

The example input

foo
a
foobar
bar
baz
fish
ford

results in

ba 2
fi 1
fo 3

Upvotes: 2

David C. Rankin

Reputation: 84579

You are started in the right direction. You do need a 2D array 27 x 27 for a single case (e.g. lowercase or uppercase), not including digits. To handle digits, just add another 11 x 11 array and map 2-digit frequencies there. The reason you can't use a flat 1D array and map to it without serious indexing gymnastics is that the ASCII sum of "ab" and "ba" would be the same.

The 2D array solves that problem allowing the map of the 1st character ASCII value to the first index, and the map of the ASCII of the 2nd character to the 2nd index or after in that row.

An easy way to think of it is to just take a lowercase example. Let's look at the word "accent". You have your 2D array:

+---+---+---+---+---+---+
| a | a | b | c | d | e | ...
+---+---+---+---+---+---+
| b | a | b | c | d | e | ...
+---+---+---+---+---+---+
| c | a | b | c | d | e | ...
+---+---+---+---+---+---+
...

The first column tracks the first letter and then the remaining columns (the next 'a' - 'z' characters) track the 2nd character that follows the first character. (you can do this will an array of struct holding the 1st char and a 26 char array as well -- up to you) This way, you remove ambiguity of which combination "ab" or "ba".

Now note -- you do not actually need a 27 x 27 arrays with the 1st column repeated. Recall, by mapping the ASCII value to the first index, it designates the first character associated with the row on its own, e.g. row[0][..] indicates the first character was 'a'. So a 26 x 26 array is fine (and the same for digits). So you simply need:

+---+---+---+---+---+
| a | b | c | d | e | ...
+---+---+---+---+---+
| a | b | c | d | e | ...
+---+---+---+---+---+
| a | b | c | d | e | ...
+---+---+---+---+---+
...

So the remainder of the approach is simple. Open the file, read the word into a buffer, validate there is a 1st character (e.g. not the nul-character), then validate the 2nd character (continue to get the next word if either validation fails). Convert both to lowercase (or add the additional arrays if tracking both cases -- that gets ugly). Now just map the ASCII value for each character to an index in the array, e.g.

    int ltrfreq[ALPHABET][ALPHABET] = {{0}};
    ...
    while (fgets (buf, SZBUF, fp)) {    /* read each line into buf */
        int ch1 = *buf, ch2;        /* initialize ch1 with 1st char */
        
        if (!ch1 || !isalpha(ch1))  /* validate 1st char or get next word */
            continue;
        
        ch2 = buf[1];               /* assign 2nd char */
        if (!ch1 || !isalpha(ch2))  /* validate 2nd char or get next word */
            continue;
        
        ch1 = tolower (ch1);        /* convert to lower to eliminate case */
        ch2 = tolower (ch2);
        
        ltrfreq[ch1-'a'][ch2-'a']++;    /* map ASCII to index, increment */
    }

With our example word "accent", that would increment the array element [0][2], so that corresponds to row 0 and column 2 for "ac" in:

+---+---+---+---+---+
| a | b | c | d | e | ...
+---+---+---+---+---+
...       ^ [0][2]

Where you increment the value at that index. So ltrfreq[0][2]++ now holds the value 1 for the combination "ac" having been seen once. When encountered again, the element would be incremented to 2 and so on... Since the value is incremented it is imperative the array be initialized all zero when declared.

When you output the results, you just have to remember to subtract 1 from the j index when mapping from index back to ASCII, e.g.

    for (int i = 0; i < ALPHABET; i++)      /* loop over all 1st char index */
        for (int j = 0; j < ALPHABET; j++)  /* loop over all 2nd char index */
            if (ltrfreq[i][j])  /* map i, j back to ASCII, output freq */
                printf ("%c%c = %d\n", i + 'a', j + 'a', ltrfreq[i][j]);

That's it. Putting it altogether in an example that takes the filename to read as the first argument to the program (or reads from stdin if no argument is given), you would have:

#include <stdio.h>
#include <ctype.h>

#define ALPHABET 26
#define SZBUF  1024

int main (int argc, char **argv) {
    
    char buf[SZBUF] = "";
    int ltrfreq[ALPHABET][ALPHABET] = {{0}};
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }
    
    while (fgets (buf, SZBUF, fp)) {    /* read each line into buf */
        int ch1 = *buf, ch2;        /* initialize ch1 with 1st char */
        
        if (!ch1 || !isalpha(ch1))  /* validate 1st char or get next word */
            continue;
        
        ch2 = buf[1];               /* assign 2nd char */
        if (!ch1 || !isalpha(ch2))  /* validate 2nd char or get next word */
            continue;
        
        ch1 = tolower (ch1);        /* convert to lower to eliminate case */
        ch2 = tolower (ch2);
        
        ltrfreq[ch1-'a'][ch2-'a']++;    /* map ASCII to index, increment */
    }
    
    if (fp != stdin)   /* close file if not stdin */
        fclose (fp);
    
    for (int i = 0; i < ALPHABET; i++)      /* loop over all 1st char index */
        for (int j = 0; j < ALPHABET; j++)  /* loop over all 2nd char index */
            if (ltrfreq[i][j])  /* map i, j back to ASCII, output freq */
                printf ("%c%c = %d\n", i + 'a', j + 'a', ltrfreq[i][j]);
}

Example Input Dictionary

In the file dat/ltrfreq2.txt:

$ cat dat/ltrfreq2.txt
My
dog
has
fleas
and
my
cat
has
none

lucky
cat!

Example Use/Output

$ ./bin/ltrfreq2 dat/ltrfreq2.txt
an = 1
ca = 2
do = 1
fl = 1
ha = 2
lu = 1
my = 2
no = 1

Where both "cat" words accurately account for ca = 2, both "has" for ha = 2 and "My" and "my" for my = 2. The rest are just the 2 character prefixes for words that appear once in the dictionary.

Or with the entire 307993 words dictionary that comes with SuSE, timed to show the efficiency of the approach (all within 15 ms):

$ time ./bin/ltrfreq2 /var/lib/dict/words
aa = 40
ab = 990
ac = 1391
ad = 1032
ae = 338
af = 411
ag = 608
ah = 68
ai = 369
aj = 18
ak = 70
al = 2029
...
zn = 2
zo = 434
zr = 2
zs = 2
zu = 57
zw = 25
zy = 135
zz = 1

real    0m0.015s
user    0m0.015s
sys     0m0.001s

A bit about the array type. Since you have 143K words, that rules out using a short or unsigned short type -- just in case you have a bad dictionary with all 143K words being "aardvark".... The int type is more than capable of handling all words -- even if you have a bad dictionary containing only "aardvark".

Look things over and let me know if this is what you need, if not let me know where I misunderstood. Also, let me know if you have further questions.

Upvotes: 1

Kaz

Reputation: 58617

There is no need to read the entire dictionary into memory, or even to buffer lines. The dictionary consists of words, one per line. This means it has this structure:

"aardvark\nabacus\n"

The first two characters of the file are the first digraph. The other interesting digraphs are all characters which immediately follow a newline.

This can be read by a state machine, which we can code into a loop like this. Suppose f is the FILE * handle to the stream reading from the dictionary file:

for (;;) {
  /* Read two characters from the dictionary file. */
  int ch0 = getc(f);
  int ch1 = getc(f);

  /* Is the ch0 newline? That means we read an empty line,
     and one character after that. So, let us move that character
     into ch0, and read another ch1. Keep doing this until
     ch0 is not a newline, and bail at EOF. */
  while (ch0 == '\n' && ch1 != EOF) {
    ch0 = ch1;
    ch1 = getc(f);
  }

  /* After the above, if we have EOF, we are done: bail the loop */
  if (ch0 == EOF || ch1 == EOF)
    break;
 
  /* We know that ch0 isn't newline. But ch1 could be newline;
     i.e. we found a one-letter-long dictionary entry. We don't
     process those, only two or more letters. */  
  if (ch1 != '\n') {
     /* Here we put the code which looks up the ch0-ch1 pair
        in our frequency table and increments the count. */
  }

  /* Now drop characters until the end of the line. If ch1
     is newline, we are already there. If not, let's just use
     ch1 for reading more characters until we get a newline. */
  while (ch1 != '\n' && ch1 != EOF)
    ch = getc(f);

  /* Watch out for EOF in the middle of a line that isn't
     newline-terminated. */
  if (ch == EOF)
     break;
}

I would do this with a state machine:

enum { begin, have_ch0, scan_eol } state = begin;
int ch0, ch1;

for (;;) {
   int c = getc(f);
   if (c == EOF)
      break;
   switch (state) {
   case begin:
      /* stay in begin state if newline seen */
      if (c != \n') {
        /* otherwise accumulate ch0,
           and switch to have_ch0 state */
        ch0 = c;
        state = have_ch0;
      }
      break;
   case have_ch0:
      if (c == '\n') {
         /* newline in ch0 state: back to begin */
         state = begin;
      } else {
         /* we got a second character! */
         ch1 = c;
         /* code for processing ch0 and ch1 goes here! */
         state = scan_eol; /* switch to scanning for EOL. */
      }
      break;
   case scan_eol:
      if (c == '\n') {
         /* We got the newline we are looking for; go
            to begin state. */
         state = begin;
      }
      break;
   }
}

Now we have a tidy loop around a single call to getc. EOF is checked in one place where we bail out of the loop. The state machine recognizes the situation when we have the first two characters of a line which is at least two characters long; there is a single place in the code where to put the logic for dealing with the two characters.

We are not allocating any buffers; we are not malloc-ing lines, so there is nothing to free. There is no limit on the dictionary size we can scan (just we have to watch for overflowing frequency counters).

Upvotes: 1

chux

Reputation: 154075

How to count the frequency of the two first letters in a word from a dictionary?

Use a simple state machine to read one character at a time, detect when the character is first 2 letters of a word, then increment a 26x26 table. Words do not need to be on seperate lines. Any word length is allowed.

unsigned long long frequency[26][26] = { 0 };  // Set all to 0
FILE *fp = fopen("large", "r");
...
int ch;
// Below 2 objects are the state machine
int word[2];
int word_length = 0;

while ((ch = fgetc(fp)) != EOF) {
  if (isalpha(ch)) {
    if (word_length < 2) {
      word[word_length++] = tolower(ch);
      if (word_length == 2) {  // 2nd letter just arrived
        assert(word[0] >= 'a' && word[0] <= 'z');  // Note 1
        assert(word[1] >= 'a' && word[1] <= 'z');
        frequency[word[0] - 'a'][word[1] - 'a']++;
      }
    }
  } else {
    word_length =  0; // Make ready for a new word
  }
} 

for (int L0 = 'a'; L0 <= 'z'; L0++);
  for (int L1 = 'a'; L1 <= 'z'; L1++);
    unsigned long long sum = frequency[L0 - 'a'][L1 - 'a'];
    if (sum) {
      printf("%c%c %llu\n", L0, L1, sum);  
      ...

Note 1, in locales that have more than a-z letters, like á, é, í, ó, ú, ü, ñ, additional needed. A simple approach is to use a frequency[256][256] - somewhat memory hoggish.

Upvotes: 1

Shawn

Reputation: 52549

The idea is to have a two-dimensional array, each dimension holding one of the first two characters of each line. The clever bit is that in C, even a string whose length as reported by strlen() to be 1 has two char's - the character and the trailing 0 at the end, so you don't need to special-case cases like "a". Its frequency is tracked in counts['a'][0].

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

/* Reads input on stdin, outputs to stdout. Using a multibyte
 * character encoding will likely cause unusual output; don't do
 * that. But it will work with encodings other than ASCII. Also handles
 * mixed-cased input.
 */

int main(void) {
   int *counts[UCHAR_MAX + 1] = { NULL };
   char *line = NULL;
   size_t bufsize = 0;
   ssize_t len;

   // Populate the frequency counts
   while ((len = getline(&line, &bufsize, stdin)) > 0) {
     if (line[len - 1] == '\n') { // Get rid of newline
       line[len - 1] = 0;
     }
     if (line[0] == 0) { // Skip empty lines
       continue;
     }
     unsigned fc = line[0];
     unsigned sc = line[1];
     if (!counts[fc]) { // Allocate the second dimension if needed
       counts[fc] = calloc(UCHAR_MAX + 1, sizeof(int));
     }
     counts[fc][sc] += 1;
   }

   // Print out the frequency table.
   for (int fc = 1; fc <= UCHAR_MAX; fc += 1) {     
     if (!counts[fc]) { // Skip unused first characters
       continue;
     }
     if (counts[fc][0]) { // Single-character line count
       printf("%c\t%d\n", fc, counts[fc][0]);
     }
     for (int sc = 1; sc <= UCHAR_MAX; sc += 1) {
       if (counts[fc][sc]) {
         printf("%c%c\t%d\n", fc, sc, counts[fc][sc]);
       }
     }
   }

   return 0;
}

Example:

$ perl -Ci -ne 'print if /^[[:ascii:]]+$/ && /^[[:lower:]]+$/' /usr/share/dict/american-english-large | ./freqs
a       1
aa      6
ab      483
ac      651
ad      497
ae      112
af      198
ag      235
ah      7
ai      161
etc.

Upvotes: 1

Limina102

Reputation: 1087

Such job is more suitable for languages like Python, Perl, Ruby etc. instead of C. I suggest at least trying C++.

If you don't have to write it in C, here is my Python version: (since you didn't mention it in the question - are you working on an embedded system or something where C/ASM are the only options?)

FILENAME = '/etc/dictionaries-common/words'

with open(FILENAME) as f:
    flattened = [ line[:2] for line in f ]

dic = {
    key: flattened.count(key)
    for key in sorted(frozenset(flattened))
}

for k, v in dic.items():
    print(f'{k} = {v}')

Outputs:

A' = 1
AM = 2
AO = 2
AW = 2
Aa = 6
Ab = 44
Ac = 37
Ad = 68
Ae = 18
Af = 22
Ag = 36
Ah = 12
Ai = 17
Aj = 2
Ak = 14
Al = 284
Am = 91
An = 223
Ap = 44
Aq = 13
Ar = 185
As = 88
At = 56
Au = 81
Av = 28
Ax = 2
... ...

Upvotes: -4

How to count the frequency of the two first letters in a word from a dictionary?

Answers (6)

Related Questions