Reputation: 1239
I want to find the new words which are inserted into a text file using Python. For example:
Old: He is a new employee here.
New: He was a new, employee there.
I want this list of words as output: ['was', ',' ,'there']
I used difflib
but it gives me the diff in a bad formatted way using '+', '-' and '?'
. I would have to parse the output to find the new words. Is there an easy way to get this done in Python?
Upvotes: 0
Views: 51
Reputation: 787
You can accomplish this with the re
module.
import re
# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')
# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."
# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)
# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
if word in old_words:
old_words.remove(word)
else:
word_differences.append(word)
# print it out to verify
print word_differences
Note that if you want to add other punctuation such as a bang or semi-colon, you must add it to the regular expression definition. Right now, it only checks for words or commas.
Upvotes: 0