Reputation: 111
I'm trying to create a "translation comparison" program that reads and compares two documents and then returns ALL words in one document that aren't in the other document. Right now, my program only returns the first instance of a word in 'file1' not being in 'file2'. This is for a beginner class, so I'm trying to avoid using obscure internal methods, even if that means less efficient code. This is what I have so far...
def translation_comparison():
import re
file1 = open("Desktop/file1.txt","r")
file2 = open("Desktop/file2.txt","r")
text1 = file1.read()
text2 = file2.read()
text1 = re.findall(r'\w+',text1)
text2 = re.findall(r'\w+',text2)
for item in text2:
if item not in text1:
return item
Upvotes: 2
Views: 62
Reputation:
While Jason Brooks's answer is perfect, I think you can have a look at this also. This utilizes the basic feature of set()
and doesn't require a loop.
def translation_comparison():
import re
file1 = open("text1.txt","r")
file2 = open("text2.txt","r")
text1 = file1.read()
text2 = file2.read()
text1 = set(re.findall(r'\w+',text1))
text2 = set(re.findall(r'\w+',text2))
return list(text1.difference(text2))
set().difference()
is a basic method. So I guess, this may not be considered as a "obscure internal method".
Upvotes: 3
Reputation: 7465
You can do something like this:
def translation_comparison():
import re
file1 = open("text1.txt","r")
file2 = open("text2.txt","r")
text1 = file1.read()
text2 = file2.read()
text1 = re.findall(r'\w+',text1)
text2 = re.findall(r'\w+',text2)
# added lines below
text1 = list(set(text1))
text2 = list(set(text2))
for word in text2:
if word in text1:
text1.remove(word)
return text1
Take a look starting at my comment. We first take the set for the lists of words in each document. This leaves us with a list of just unique words, just in case there are duplicates. Next, we loop through each word in the second text, and if that word exists in the first text as well, we remove it from the list of words in the first text. At the end, we'll be left with only words in text1
that are not also in text2
. We return that list at the end, which contains all those words.
Let me know if this makes sense, or if you have any questions.
Edit: As per the suggestion from @blckknght, a much simpler way to do this is to simply use set subtraction as follows:
def translation_comparison():
import re
file1 = open("text1.txt","r")
file2 = open("text2.txt","r")
text1 = file1.read()
text2 = file2.read()
text1 = re.findall(r'\w+',text1)
text2 = re.findall(r'\w+',text2)
return list(set(text1) - set(text2))
Also note that this considers the same word capitalized differently (ex: The
vs the
) as separate words. Although this is simple to fix with basic list comprehension: text1 = [x.lower() for x in text1]
and text2 = [x.lower() for x in text2]
.
Upvotes: 1
Reputation: 167
Take care of the capitalized words. Example "Foo" and "foo" will be treated as two different words when in fact they are same. The code will view this as a non match and will return
Upvotes: 1