Piyush Kansal
Piyush Kansal

Reputation: 1241

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Lets say the input file is:

Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ

I have to create the following output:

Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1

The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.

How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.

To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?

Upvotes: 3

Views: 211

Answers (3)

cli_hlt
cli_hlt

Reputation: 7164

Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.

Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).

Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

Upvotes: 0

hobbs
hobbs

Reputation: 240334

uniq -c filename | perl -lane 'print "@F[1..$#F] $F[0]"'

The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.

As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.

The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.

Upvotes: 2

Daniel Placek
Daniel Placek

Reputation: 765

In most cases this operation will be completely I/O bound. (Especially using well-designed C++)

Given that, its likely the only bottleneck you need to care about is the disk.

I think you will find this to be relevant: mmap() vs. reading blocks

Ben Collins has a very good answer comparing mmap to standard read/write.

Upvotes: 1

Related Questions