Reputation: 39
I have a 16 letter alphabet. Given a sentence, I would like to count the frequency of each letter, and then encapsulate all frequencies in one number using clever bit shifting. Lets assume those sentences are always 100 letter each, and assuming no letter occurs more than 31 times, I would like something like this:
A: occurs 2 times -> 0010
B: occurs 10 times -> 1010
C: occurs 7 times -> 0111
Etc.
Now, I would like to concatenation those like this: 001010100111...
I just concentrated the frequencies above. To store the number easily, I wanted to convert the binary above to a 64 bit unsigned int.
My other requirement is to have that long and re extract the frequencies back per letter. So, I will need to be able to generate the decimal then parse it into the individual frequency bits.
How would I do that in c? I can do bit shifting and additions of those frequencies but that means I'm overlapping frequencies. The other issue is when extracting frequencies, how do I know how many bits to shift since trailing 0s are insignificant and not saved in the decimal but they are really important in my algorithm.
Any clever ideas? Thank you.
Upvotes: 2
Views: 684
Reputation: 28278
Existing answers are good; maybe the following is better though.
It is easy to use just one 64-bit number, and increment individual 4-bit parts in it.
For example, the following increases the counter for the 3rd, 5th and 13th letter (counting from 0):
uint64_t my_counters = 0;
my_counters += (uint64_t)1 << (4 * 3);
my_counters += (uint64_t)1 << (4 * 5);
my_counters += (uint64_t)1 << (4 * 13);
If your letters are consecutive in the ASCII table (for example, [a-p]
), it is easy to calculate the index of the letter from its numerical value:
uint64_t my_counters = 0;
size_t i;
for (i = 0; str[i] != '\0'; ++i)
{
int index = str[i] - 'a';
my_counters += (uint64_t)1 << (4 * index);
}
To print:
char c;
for (c = 'a'; c <= 'p'; ++c)
{
int index = c - 'a';
int counter = (int)((my_counters >> (4 * index)) & 0xf);
printf("Letter %c, count %d\n", c, counter);
}
Note: my code concatenates the bits in the opposite order comparing to what you want; it seems that this way makes it more clear. You can reverse the order if you replace 4 * index
by 60 - 4 * index
.
Upvotes: 1
Reputation: 5289
Try this, the advantage is that there will be no need of intermediate array to count your letters:
int ch_to_index(char ch) { return ch-'A'; }
unsigned long long get_freq(unsigned long long freq, int index)
{
return (freq>>(4*index))&0x0f;
}
unsigned long long set_freq(unsigned long long freq, int index, unsigned long val)
{
return ( ((val&0x0fULL)<<(4*index)) | (freq & (0xffffffffffffffffULL ^ (0xfULL<<(4*index)))) );
}
unsigned long long inc_freq(unsigned long long freq, int index)
{
return set_freq(freq, index, get_freq(freq, index) +1) ;
}
int main()
{
int i;
unsigned long long freq=0;
freq = inc_freq(freq, ch_to_index('A'));
freq = inc_freq(freq, ch_to_index('A'));
freq = inc_freq(freq, ch_to_index('B'));
for(i=0;i<16;i++)
{
printf("%i = %i\n", i, (int)get_freq(freq, i));
}
}
Upvotes: 1
Reputation: 134045
You have two problems: a mathematical problem and a coding problem.
Let's ignore the math problem for the moment. You can build an array with 16 integers and count the occurrences of each letter when you scan the text. If you assume that no letter occurs more than 15 times, then you don't have to worry about overflow and you can put the counts into your 64-bit integer easily enough. You'd write:
int counts[16]; // has the counts
unsigned long long freqs; // this holds the encoded value
// after you compute the counts
freqs = 0;
for (int i = 0; i < 16; ++i)
{
freqs <<= 4;
freqs |= (counts[i] & 0xF);
}
At that point, the count for the first letter is in the top 4 bits of freqs
, and the count for the last letter is is the bottom four bits. All the other counts are in between. Each one occupies exactly 4 bits of that 64-bit number.
Now, if you want the ability to do this with much larger text, or a letter can occur more often than 15 times, you have to scale your numbers after counting so that the maximum is no larger than 15. That's the math problem I alluded to. I think you can probably figure out how to handle that one. You just have to scale the numbers.
Upvotes: 5
Reputation: 5488
Something like this should suffice:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
const static int SIZE = 16;
const static char ALPHABET[] = "0123456789ABCDEF";
char* getFrequency(char* str);
uint64_t getFrequencyNumber(char* freq);
int main() {
char* str = "1337CODE";
uint64_t freqNum = getFrequencyNumber(getFrequency(str));
printf("%llu\n",freqNum);
return 0;
}
char* getFrequency(char* str) {
int i,j;
char* freq = (char*) calloc(SIZE, sizeof(char));
for(i=0; str[i]; ++i)
for(j=0; j<SIZE; ++j)
if(str[i] == ALPHABET[j])
if(freq[i] < 15) //ignore overflow
(freq[j])++;
return freq;
}
uint64_t getFrequencyNumber(char* freq) {
uint64_t i,num;
for(i=num=0; i<SIZE; ++i)
num |= freq[i] << (4*i); //use bit shifting to concatenate 4 bit values
return num;
}
Upvotes: 1