Steg Verner
Steg Verner

Reputation: 913

Deleting duplicate lines in python

I have data of the following form:

#@ <abc>
 <http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <ghi>
 _:question1 <http#responseCode> "200"^^<http://integer> .
#@ <klm>
<The> <second> <http://line1.xml> .
#@ <nop>
 _:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k93>
 _:question1 <http#responseCode> "200"^^<http://integer> .
#@ <k22>
<This> <third> <http://line2.xml> .
#@ <k73>
 <http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .

Now whenever two lines are equal, like: _:question1 <http#responseCode> "200"^^<http://integer> ., then I want to delete the equal lines (lines which match with each other character by character are equal lines) along with (i). the subsequent line (which ends with a fullstop) (ii). line previous to the equal line (which begins with #@).

#@ <abc>
 <http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <nop>
 _:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k73>
 <http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .

Now one way to do this is to store all these lines in a set in python and whenever two lines are equal (i.e. they match character by character) the previous and subsequent two lines are deleted. However, the size of my dataset is 100GB (and I have RAM of size 64GB), therefore I can not keep this information in set form in main-memory. Is there some way by which I can delete the duplicate lines along with their previous and subsequent two lines in python with limited main-memory space (RAM size 64 GB)

Upvotes: 2

Views: 145

Answers (3)

pvg
pvg

Reputation: 2729

One fairly straightforward way - make a version of your data such that each line includes a field with its line number. Use unix 'sort' to sort that new file, excluding the line number field. The sort utility will merge sort the file even if it exceeds the size of available memory. Now you have a new file in which the duplicates are ordered, along with their original line numbers. Extract the line numbers of the duplicates and then use that as input for linearly processing your original data.

In more detailed steps.

  • Make a new version of your file such that each line is prepended by its line number. So, "someline" becomes "1, someline"
  • sort this file using the unix sort utility - sort -t"," -k2,2 file
  • Scan the new file for consecutive duplicate entries in the second field
  • the line numbers (first field) of such entries are the line numbers of duplicate lines in your original file - extract these and use them as input to remove duplicates in the original data. Since you know exactly where they are, you need not read in the entire file or create a giant in-memory structure for duplicates

The advantage of this method compared to some of the others suggested - it always works, regardless of the size of the input and the size of your available memory and it does not fail due to hash collisions or other probabilistic artifacts. You are leveraging the merge sort in unix sort where the hard stuff - dealing with larger-than-memory input - has been done for you.

Upvotes: 1

dshepherd
dshepherd

Reputation: 5407

Here's an outline of how I'd do it using UNIX sort/uniq:

  1. Modify the data format so that each record is a single line. You could do this using the methods here.

  2. Sort the data with the sort command. Note the you can specify which fields are important with the --key option, you might need to exclude the #@ <abc> part by selecting all the other fields as keys (I wasn't entirely sure from your description).

  3. Apply the uniq command to the sorted output to get only the unique lines.

This should all work fine on out-of-core data as far as I know.

Upvotes: 1

fferri
fferri

Reputation: 18940

Keep a boolean hashtable of hash codes of lines already seen.

For each line:

  • if line hash()es to something you have already seen, you have a potential match: scan the file to check if it really is a duplicate.

  • if line hash()es to a new hash, just mark the hash for the first time.

Dedicate as much memory you can to this hashtable, and the false positive rate will be low (i.e. less times you will have to scan for duplicates and found none).

Example:

table_size = 2**16
seen = [False]*table_size
for line in file:
    h = hash(line) % table_size
    if seen[h]:
        dup = False
        with open('yourfile','r') as f:
            for line1 in f:
                if line == line1:
                    dup = True
                    break
            if not dup:
                print(line)
    else:
        seen[h] = True
        print(line)

As it has been pointed out, since you cannot store all the lines in memory you don't have many options, but at least this option doesn't require to scan the file for every single line, because most of the entries in the table will be False, i.e. the algorithm is sub-quadratic if the tabe is not full; it will degenerate to O(n2) once the table is full.

You can make a very memory-efficient implementation of the hash table, that requires only 1 bit for each hash code (e.g. make it an array of bytes, where each byte can store 8 boolean values)


See also Bloom Filters for more advanced techniques.

Upvotes: 2

Related Questions