Merovingean
Merovingean

Reputation: 23

Merge multiple txt files into one

Good afternoon.

I ran into some problems. I can solve it, but I have little experience in programming, and it seems to me that there is a more beautiful and rational solutions to this problem.

The problem is as follows. Given a set of text files with a total size of more than one hundred megabytes. The number of files from 2 to N. Files contain sorted unique numbers (for example, IDs). It wanted to merge all the numbers in one output file. Inside the resulting file numbers also need to be sorted.

I'm going to solve this problem as follows: Open all files. Take out the first number of each file. Put them in a container (eg, a vector). Find the smallest number within the container. Put the minimal number in the output file. In his place, put the following number from the file, from which was taken the minimal number. It seems that this method is called "external merge sort".

Tell me, please, is there a smarter way to solve this problem?

Upvotes: 0

Views: 2008

Answers (3)

Captain Giraffe
Captain Giraffe

Reputation: 14705

External mergesort was created for this exact problem. The complexity of this sort is N * log(number_of_files).

For simplicity in your code you can store the filestreams along with the numbers in a priority queue.

Completely untested, but maybe helpful code:

std::vector<ifstream> file_streams(stream_count);  
// open streams.
using int_and_stream = std::pair<int, std::ifstream&>;
using cont = std::vector<int_and_stream>;
std::priority_queue<int_and_stream, cont, pair_comparer> queue;

for(auto& stream: file_streams){
   int id;
   stream >> id;
   queue.push(std::make_pair(id, stream));
}

while(!queue.empty()){
   auto smallest = queue.top();
   outstream << smallest.first;
   int id;
   if(smallest.second >> id){ // file ended?
      queue.push(std::make_pair(id, stream));
   }
}              

For the pair_comparer you can look here

Upvotes: 3

Serge Rogatch
Serge Rogatch

Reputation: 15030

A better approach would be to store the current head of each file in a priority queue. Then in each step you take the top of the priority queue (noting the input file this item is taken from), write it to the output file, and push the next item of that input file into the priority queue.

Upvotes: 0

Valeri Atamaniouk
Valeri Atamaniouk

Reputation: 5163

Your approach is good.

But you can have a sorted vector of number/file pairs, so you would spend less time finding the smallest number, as after taking the smallest entry out, you can read next value and insert it back into array using more efficient algorithm, than linear sort. This is more efficient when you have large number of input files.

But assuming that cost of I/O is much more expensive, than number comparison, the approach is OK.

Upvotes: 0

Related Questions