Reputation: 913
I have data of the following form:
#@ <abc>
<http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <ghi>
_:question1 <http#responseCode> "200"^^<http://integer> .
#@ <klm>
<The> <second> <http://line1.xml> .
#@ <nop>
_:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k93>
_:question1 <http#responseCode> "200"^^<http://integer> .
#@ <k22>
<This> <third> <http://line2.xml> .
#@ <k73>
<http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .
Now whenever two lines are equal, like: _:question1 <http#responseCode> "200"^^<http://integer> .
, then I want to delete the equal lines (lines which match with each other character by character are equal lines) along with (i). the subsequent line (which ends with a fullstop) (ii). line previous to the equal line (which begins with #@).
#@ <abc>
<http://stackoverflow.com/questions/ask> <question> _:question1 .
#@ <def>
<The> <second> <http://line> .
#@ <nop>
_:question1 <date> "Mon, 23 Apr 2012 13:49:27 GMT" .
#@ <jkn>
<G> <http#fifth> "200"^^<http://integer> .
#@ <k73>
<http://site1> <hasAddress> <http://addr1> .
#@ <i27>
<kd8> <fourth> <http://addr2.xml> .
Now one way to do this is to store all these lines in a set in python and whenever two lines are equal (i.e. they match character by character) the previous and subsequent two lines are deleted. However, the size of my dataset is 100GB (and I have RAM of size 64GB), therefore I can not keep this information in set form in main-memory. Is there some way by which I can delete the duplicate lines along with their previous and subsequent two lines in python with limited main-memory space (RAM size 64 GB)
Upvotes: 2
Views: 145
Reputation: 2729
One fairly straightforward way - make a version of your data such that each line includes a field with its line number. Use unix 'sort' to sort that new file, excluding the line number field. The sort utility will merge sort the file even if it exceeds the size of available memory. Now you have a new file in which the duplicates are ordered, along with their original line numbers. Extract the line numbers of the duplicates and then use that as input for linearly processing your original data.
In more detailed steps.
The advantage of this method compared to some of the others suggested - it always works, regardless of the size of the input and the size of your available memory and it does not fail due to hash collisions or other probabilistic artifacts. You are leveraging the merge sort in unix sort where the hard stuff - dealing with larger-than-memory input - has been done for you.
Upvotes: 1
Reputation: 5407
Here's an outline of how I'd do it using UNIX sort/uniq:
Modify the data format so that each record is a single line. You could do this using the methods here.
Sort the data with the sort
command. Note the you can specify which fields are important with the --key
option, you might need to exclude the #@ <abc>
part by selecting all the other fields as keys (I wasn't entirely sure from your description).
Apply the uniq
command to the sorted output to get only the unique lines.
This should all work fine on out-of-core data as far as I know.
Upvotes: 1
Reputation: 18940
Keep a boolean hashtable of hash codes of lines already seen.
For each line:
if line hash()es to something you have already seen, you have a potential match: scan the file to check if it really is a duplicate.
if line hash()es to a new hash, just mark the hash for the first time.
Dedicate as much memory you can to this hashtable, and the false positive rate will be low (i.e. less times you will have to scan for duplicates and found none).
Example:
table_size = 2**16
seen = [False]*table_size
for line in file:
h = hash(line) % table_size
if seen[h]:
dup = False
with open('yourfile','r') as f:
for line1 in f:
if line == line1:
dup = True
break
if not dup:
print(line)
else:
seen[h] = True
print(line)
As it has been pointed out, since you cannot store all the lines in memory you don't have many options, but at least this option doesn't require to scan the file for every single line, because most of the entries in the table will be False, i.e. the algorithm is sub-quadratic if the tabe is not full; it will degenerate to O(n2) once the table is full.
You can make a very memory-efficient implementation of the hash table, that requires only 1 bit for each hash code (e.g. make it an array of bytes, where each byte can store 8 boolean values)
See also Bloom Filters for more advanced techniques.
Upvotes: 2