codinggoose
codinggoose

Reputation: 133

Memory efficient way to remove duplicate lines in a text file using C++

What is the most memory efficient way to remove duplicate lines in a large text file using C++?

Let me clarify, I'm not asking for code, just the best method. The duplicate lines are not guaranteed to be adjacent. I realize that an approach optimized for minimal memory usage would result in slower speeds however this is my restriction as the files are far too large.

Upvotes: 13

Views: 10111

Answers (5)

Permaquid
Permaquid

Reputation: 1978

Why not just consult Knuth, Sorting and Searching? That will give you a great background for making a balanced decision.

Upvotes: 1

santosc
santosc

Reputation: 1273

Simple brute-force solution (very little memory consumption): Do an n^2 pass through the file and remove duplicate lines. Speed: O(n^2), Memory: constant

Fast (but poor, memory consumption): Stefan Kendall's solution: hash each line, store them in a map of some sort and remove a line whose has already exists. Speed: O(n), memory: O(n)

If you're willing to sacrifice file order (I assume not, but I'll add it): You can sort the lines, then pass through removing duplicates. speed: O(n*log(n)), Memory: constant

edit: If you dont like the idea of sorting the file contents or trying to maintain unique hashes but can handle O(n) memory usage: You can identify each line with it's 32 bit or 64 bit position marker (depending on the file's size) and sort the file positions instead of the file contents.

edit #2: caveat: in-memory sorting lines of different lengths is harder than doing it to say, an array of ints...actually, thinking about how the memory would have to shift and move in a merge step, I'm second guessing my ability to sort a file like that in n*log(n)

Upvotes: 2

user262976
user262976

Reputation: 2074

i would hash each line and then seek back to lines that have non-unique hashes and compare them individually (or in a buffered fashion). this would work well on files with a relatively low occurence of duplicates.

When you use a hash, you can set the memory used to a constant amount (i.e., you could have a tiny hash table with just 256 slots or something larger. in any case, the amount of mem can be restricted to any constant amount.) the values in the table are the offset of the lines with that hash. so you only need line_count*sizeof(int) plus a constant to maintain the hash table.

even simpler (but much slower) would be to scan the entire file for each line. but i prefer the first option. this is the most memory efficient option possible. you would only need to store 2 offsets and 2 bytes to do the comparison.

Upvotes: 6

Stefan Kendall
Stefan Kendall

Reputation: 67862

To minimize memory usage:

If you have unlimited (or very fast) disk i/o, you could write each line to its own file with the filename being the hash + some identifier indicating order (or no order, if order is irrelevant). In this manner, you use the file-system as extended memory. This should be much faster than re-scanning the whole file for each line.

As an addition from what those have said below, if you expect a high duplicate rate, you could maintain some threshold of the hashes in memory as well as in file. This would give much better results for high duplicate rates. Since the file is so large, I doubt n^2 is acceptable in processing time. My solution is O(n) in processing speed and O(1) in memory. It's O(n) in required disk space used during runtime, however, which other solutions do not have.

It sounds like you might be running on limited hardware of varied specs, so you'll want to test a number of duplicate removing algorithms and profile before you decide which is best for long-term implementation.

Upvotes: 3

Anatoly Fayngelerin
Anatoly Fayngelerin

Reputation: 676

You could use an I/O efficient sort(like the unix sort command) and then read the file in line by line comparing each line to the one previously read. If the two are equal don't output anything if they aren't output the line.

This way the amount of memory used by the algorithm is constant.

Upvotes: 2

Related Questions