Reputation: 3413

Counting unique words in a file? Good Linear Search alternative?

I'm using a naive approach to this problem, I'm putting the words in a linked list and just making a linear search into it. But it's taking too much time in large files.

I was thinking in use a Binary Search Tree but I don't know if it works good with strings. Also heard of Skip Lists, didn't really learn it yet.

And also I have to use the C language...

Upvotes: 1

Answers (7)

Marco

Reputation: 2907

The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.

Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).

In "The C Programming Language" From K&R you have the exact example of what you are looking for. The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)

the structure is something like this:

struct tnode {
        char *word;
        struct tnode *left;
        struct tnode *right;
};

In the book you can see the whole example of what you want to do.

Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.

Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p

EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..

Upvotes: 1

viksit

Reputation: 7752

You're counting the number of unique words in the file?

Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).

Upvotes: 3

Dummy00001

Reputation: 17430

If you need something simple and easily available then man tsearch for simple binary search tree. But this is plain binary search tree, not balanced.

Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)

More advanced options are often platforms specific (e.g. glib on Linux).

P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.

Upvotes: 1

Steve Emmerson

Reputation: 7832

If you're on a UNIX system, then you could use the bsearch() or hsearch() family of functions instead of a linear search.

Upvotes: 1

Nikita Rybak

Reputation: 68016

I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.

Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.

edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages

Upvotes: 1