Hellboy
Hellboy

Reputation: 1239

Find new inserted words in text file

I want to find the new words which are inserted into a text file using Python. For example:

Old: He is a new employee here.
New: He was a new, employee there.

I want this list of words as output: ['was', ',' ,'there']

I used difflib but it gives me the diff in a bad formatted way using '+', '-' and '?'. I would have to parse the output to find the new words. Is there an easy way to get this done in Python?

Upvotes: 0

Views: 51

Answers (2)

Hellboy
Hellboy

Reputation: 1239

I used Google Diff-Patch-Match. It works fine.

Upvotes: 0

Jordan McQueen
Jordan McQueen

Reputation: 787

You can accomplish this with the re module.

import re

# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')

# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."

# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)

# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
    if word in old_words:
        old_words.remove(word)
    else:
        word_differences.append(word)

# print it out to verify
print word_differences

Note that if you want to add other punctuation such as a bang or semi-colon, you must add it to the regular expression definition. Right now, it only checks for words or commas.

Upvotes: 0

Related Questions