Reputation: 1241
Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?
Upvotes: 3
Views: 211
Reputation: 7164
Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.
Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).
Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.
Upvotes: 0
Reputation: 240334
uniq -c filename | perl -lane 'print "@F[1..$#F] $F[0]"'
The perl step is only to take the output of uniq
(which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.
As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap
and strchr
seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.
The code for BSD uniq could be illustrative here. It does a very simple job with fgets
, strcmp
, and a very few variables.
Upvotes: 2
Reputation: 765
In most cases this operation will be completely I/O bound. (Especially using well-designed C++)
Given that, its likely the only bottleneck you need to care about is the disk.
I think you will find this to be relevant: mmap() vs. reading blocks
Ben Collins has a very good answer comparing mmap to standard read/write.
Upvotes: 1