paul5345
paul5345

Reputation: 3

C++ How to count number of collisions while using a hash function?

I was assigned this lab in which I needed to create a hash function, and count the number of collisions that occur when hashing a file ranging up to 30000 elements. Here is my code so far

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

long hashcode(string s){
  long seed = 31; 
  long hash = 0;
  for(int i = 0; i < s.length(); i++){
    hash = (hash * seed) + s[i];
  }
  return hash % 10007;
};

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    count++;
    for(int i = 0; i<count; i++){
        if(array[i]==hashcode(x)){
            collisions++;
        }
    }
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

I am just not sure of how to count the number of collisions. I tried storing every hashed value to an array and then search that array, but it resulted in like 12000 collisions when there were only 10000 elements. Any advice at all on how to count the collisions or even if my hash function could use improvement, would be appreciated. Thank you.

Upvotes: 0

Views: 9328

Answers (3)

Richard Chambers
Richard Chambers

Reputation: 17573

For an explanation of hash tables see How does a hash table work?

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

// Generate a hash code that is in the range of our hash table.
// The range we are using is zero to 10,007 so that our table is
// large enough and the prime number size reduces the probability
// of collisions from different strings hashing to the same value.
unsigned long hashcode(string s){
    unsigned long seed = 31;
    unsigned long hash = 0;
    for (int i = 0; i < s.length(); i++){
        hash = (hash * seed) + s[i];
    }
    // we want to generate a hash code that is the size of our table.
    // so we mod the calculated hash to ensure that it is in the proper range
    // of our hash table entries. 10007 is a prime number which provides
    // better characteristics than a non-prime number table size.
    return hash % 10007; 
};

int main(int argc, char * argv[]){
    int count = 0;
    int collisions = 0;
    fstream input(argv[1]);
    string x;
    int array[30000] = { 0 };

    //File stream
    while (!input.eof()){
        input >> x;     // get the next string to hash
        count++;        // count the number of strings hashed.
        // hash the string and use the hash as an index into our hash table.
        // the hash table is only used to keep a count of how many times a particular
        // hash has been generated. So the table entries are ints that start with zero.
        // If the value is greater than zero then we have a collision.
        // So we use postfix increment to check the existing value while incrementing
        // the hash table entry.
        if ((array[hashcode(x)]++) > 0)
            collisions++;
    }
    cout << "Total Input is " << count << endl;
    cout << "Collision # is " << collisions << endl;

    return 0;
}

Upvotes: 0

Richard Hodges
Richard Hodges

Reputation: 69864

Answer added in the interests of education. It's probably your professor's next lesson.

Almost certainly the most efficient way to detect hash collision is to use a hash set (a.k.a. unordered_set)

#include <iostream>
#include <unordered_set>
#include <fstream>
#include <string>

// your hash algorithm
long hashcode(std::string const &s) {
    long seed = 31;
    long hash = 0;
    for (int i = 0; i < s.length(); i++) {
        hash = (hash * seed) + s[i];
    }
    return hash % 10007;
};

int main(int argc, char **argv) {
    std::ifstream is{argv[1]};
    std::unordered_set<long> seen_before;
    seen_before.reserve(10007);
    std::string buffer;
    int collisions = 0, count = 0;
    while (is >> buffer) {
        ++count;
        auto hash = hashcode(buffer);
        auto i = seen_before.find(hash);
        if (i == seen_before.end()) {
            seen_before.emplace_hint(i, hash);
        }
        else {
            ++collisions;
        }
    }
    std::cout << "Total Input is " << count << std::endl;
    std::cout << "Collision # is " << collisions << std::endl;
}

Upvotes: 1

muzzlator
muzzlator

Reputation: 742

The problem is you're recounting collisions (Suppose you had 4 of the same elements in your list and nothing else, and go through your algorithm to see how many collisions you'd count)

Instead, create a set of hashcodes and each time you compute a hashcode, check if it's in the set. If it's in the set, increment total number of collisions. If it's not in the set, add it to the set.

Edit:

To quickly patch your algorithm, I've done the following: incremented count after the loop and broken out of the for loop once I find a collision. This is still not super efficient since we're looping through all the results (using a set data structure would be faster) but this should at least be correct.

Also tweaked it so we don't calculate hashcode(x) over and over:

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    for(int i = 0; i<count; i++){
        if(array[i]==array[count]){
            collisions++;
            // Once we've found one collision, we don't want to count all of them.
            break;
        }
    }
    // We don't want to check our hashcode against the value we just added
    // so we should only increment count here.
    count++;
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

Upvotes: 3

Related Questions