Reputation: 21
I had a huge document with many repeated sentences such as (footer text, hyperlinks with alphanumeric chars), I need to get rid of those repeated hyperlinks or Footer text. I have tried with the below code but unfortunately couldn't succeed. Please review and help.
corpus = "We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar.txt and the output is stored in foo.txt. These files should be in the same directory as the python script file, else it won’t work.Now, we should crop our big image to extract small images with amounts.In terms of topic modelling, the composites are documents and the parts are words and/or phrases (phrases n words in length are referred to as n-grams).We use file handling methods in python to remove duplicate lines in python text file or function.As an example I will use some image of a bill, saved in the pdf format. From this bill I want to extract some amounts.All our wrappers, except of textract, can’t work with the pdf format, so we should transform our pdf file to the image (jpg). We will use wand for this.Now, we should crop our big image to extract small images with amounts."
from nltk.tokenize import sent_tokenize
sentences_with_dups = []
for sentence in corpus:
words = sentence.sent_tokenize(corpus)
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)
print(sentences_with_dups)
else:
print('No duplciates found')
Error message for the above code :
AttributeError: 'str' object has no attribute 'sent_tokenize'
Desired Output :
Duplicates = ['We use file handling methods in python to remove duplicate lines in python text file or function.','Now, we should crop our big image to extract small images with amounts.']
Cleaned_corpus = {removed duplicates from corpus}
Upvotes: 2
Views: 2600
Reputation: 9018
First of all, the example you provided is messed up with spaces between the last period and next sentence, there are a lot of space missing in between them, so I cleaned up.
Then you can do:
corpus = "......"
sentences = sent_tokenize(corpus)
duplicates = list(set([s for s in sentences if sentences.count(s) > 1]))
cleaned = list(set(sentences))
Above will mess the order. If you care about the order, you can do the following to preserve:
duplicates = []
cleaned = []
for s in sentences:
if s in cleaned:
if s in duplicates:
continue
else:
duplicates.append(s)
else:
cleaned.append(s)
Upvotes: 3