Reputation: 31

Determining string uniqueness in a large file

In C, I want to process a file that contains 10⁸ 16-digit alphanumeric strings and determine if each one is unique in the file. How can I do that?

Upvotes: 3

Answers (5)

Mike

Reputation: 3460

Do a bucket sort(Hash function) into multiple files, one file for each bucket. Then process each bucket's file to determine if all strings are unique within the bucket.

Upvotes: 1

bta

Reputation: 45077

As other people have said, the most straightforward method is to simply load the entire file and use something like qsort to sort it.

If you can't load that much into memory at once, another option is to load the data in several passes. On your first pass, read the file and only load in lines that start with A. Sort those and find the unique lines. For the next pass, load all the lines that start with B, sort, and find unique lines. Repeat this process for every alphanumeric character that a line might start with. Using this technique, you should only have to load a fraction of the file into memory at a time and it shouldn't cause you to mis-classify any lines.

Upvotes: 2

user411313

Reputation: 3990

Take a library with set/map functions, e.g. see link text

Upvotes: 0

Codo

Reputation: 78915

You'll need to sort the file.

Just load it into a single memory block, run qsort from the C runtime library on the memory block and the finally run sequentially over all strings to check for two consecutive strings that are the same.

Upvotes: 0

Jerry Coffin

Reputation: 490408

Given that you're talking about ~16 megabytes of data, the obvious way to do it would be to just load the data into a hash table (or something on that order) and count the occurrences of each string.

I can't quite imagine doing this in C though -- most other languages will supply a reasonable data structure (some sort of map), making the job substantially easier.

Upvotes: 1

Determining string uniqueness in a large file

Answers (5)

Related Questions