Reputation: 984

python similar string removal from multiple files

I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions. Here is what I have tried:

import difflib

sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:

    destfile = open('%s.txt' % destname, 'a+')
    destlines = destfile.readlines()

    similar = False
    for destline in destlines:
        ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
        if ratio > 0.8:
            print destline
            print sourceline
            similar = True

    if not similar:
        destfile.write(sourceline)

    destfile.close()

I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.

EDIT: I have tried the code of the answer. It's still very slow. Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.

Any other ways to remove the duplicates?

Upvotes: 1

Answers (3)

brice

Reputation: 25039

Here is a short version that does minimal IO and cleans up after itself.

import difflib

sourcename = 'xiaoshanwujzw'
destname = 'bindresult'

with open('%s.txt' % destname, 'w+') as destfile:

  # we read in the file so that on subsequent runs of this script, we 
  # won't duplicate the lines.
  known_lines = set(destfile.readlines())

  with open('%s.txt' % sourcename) as sourcefile:
    for line in sourcefile:
      similar = False
      for known in known_lines:
        ratio = difflib.SequenceMatcher(None, line, known).ratio()
        if ratio > 0.8:
          print ratio
          print line
          print known
          similar = True
          break
      if not similar:
        destfile.write(line)
        known_lines.add(line)

Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.

A note on complexity

By its very nature, this problem has a O(n²) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.

There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.

You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().

Upvotes: 2

John

Reputation: 13699

Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.

##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data

##bindresult.txt
##--------------
##a website line
##this is data
##and more data

from difflib import SequenceMatcher

sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()

destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()


has_matches = {k: False for k in sourcelines}

for d_line in destlines:

    for s_line in sourcelines:

        if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
            has_matches[s_line] = True
            break

for k in has_matches:
    if has_matches[k] == False:
        destfile.write(k)

destfile.close()

This will add the line radically different thing`` to the destinationfile.

Upvotes: 0

David Marx

Reputation: 8558

Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.

Upvotes: 0

python similar string removal from multiple files

Answers (3)

A note on complexity

Related Questions