new recruit 21
new recruit 21

Reputation: 111

Python Document Comparison - returning ALL words NOT IN other document

I'm trying to create a "translation comparison" program that reads and compares two documents and then returns ALL words in one document that aren't in the other document. Right now, my program only returns the first instance of a word in 'file1' not being in 'file2'. This is for a beginner class, so I'm trying to avoid using obscure internal methods, even if that means less efficient code. This is what I have so far...

def translation_comparison():
   import re
   file1 = open("Desktop/file1.txt","r")
   file2 = open("Desktop/file2.txt","r")
   text1 = file1.read()
   text2 = file2.read()
   text1 = re.findall(r'\w+',text1)
   text2 = re.findall(r'\w+',text2)
   for item in text2:
       if item not in text1:
           return item  

Upvotes: 2

Views: 62

Answers (3)

user4679058
user4679058

Reputation:

While Jason Brooks's answer is perfect, I think you can have a look at this also. This utilizes the basic feature of set() and doesn't require a loop.

def translation_comparison():
    import re
    file1 = open("text1.txt","r")
    file2 = open("text2.txt","r")
    text1 = file1.read()
    text2 = file2.read()
    text1 = set(re.findall(r'\w+',text1))
    text2 = set(re.findall(r'\w+',text2))
    return list(text1.difference(text2))

set().difference() is a basic method. So I guess, this may not be considered as a "obscure internal method".

Upvotes: 3

Jason B
Jason B

Reputation: 7465

You can do something like this:

def translation_comparison():
   import re
   file1 = open("text1.txt","r")
   file2 = open("text2.txt","r")
   text1 = file1.read()
   text2 = file2.read()
   text1 = re.findall(r'\w+',text1)
   text2 = re.findall(r'\w+',text2)
   # added lines below
   text1 = list(set(text1))
   text2 = list(set(text2))
   for word in text2:
    if word in text1:
        text1.remove(word)
   return text1

Take a look starting at my comment. We first take the set for the lists of words in each document. This leaves us with a list of just unique words, just in case there are duplicates. Next, we loop through each word in the second text, and if that word exists in the first text as well, we remove it from the list of words in the first text. At the end, we'll be left with only words in text1 that are not also in text2. We return that list at the end, which contains all those words.

Let me know if this makes sense, or if you have any questions.

Edit: As per the suggestion from @blckknght, a much simpler way to do this is to simply use set subtraction as follows:

def translation_comparison():
   import re
   file1 = open("text1.txt","r")
   file2 = open("text2.txt","r")
   text1 = file1.read()
   text2 = file2.read()
   text1 = re.findall(r'\w+',text1)
   text2 = re.findall(r'\w+',text2)
   return list(set(text1) - set(text2))

Also note that this considers the same word capitalized differently (ex: The vs the) as separate words. Although this is simple to fix with basic list comprehension: text1 = [x.lower() for x in text1] and text2 = [x.lower() for x in text2].

Upvotes: 1

user4642224
user4642224

Reputation: 167

Take care of the capitalized words. Example "Foo" and "foo" will be treated as two different words when in fact they are same. The code will view this as a non match and will return

Upvotes: 1

Related Questions