Reputation: 5
I have a corpus of text with around 300k sentences. I want to have only unique sentences, this means that if I have a sentence with two times frequency I want to only have one of them.
This is what I tried in python 3:
def unique_sentences(data):
u_sent = list(set([w for w in data.split('.')]))
return ".".join(u_sent)
The problem is it removes also unique sentences. Do you know any other way to do it in python?
Upvotes: 0
Views: 441
Reputation: 300
I suggest splitting the text data using a well-known library like the NLTK. I got the following results when I ran your code on a sample text:
Input: 'This is an example. It is another one. This is the third one. This is an example. This is an example.'
Output: .This is an example. This is the third one. This is an example. It is another one
But I got the desired results when I used the NLTK library to splitting the sentences using the following code:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
unique_sentences = set(sent_tokenize(data))
Output: {'It is another one.', 'This is the third one.', 'This is an example.'}
Furthermore, you can use the following method to getting unique sentences if you care about the order of the sentences:
from collections import OrderedDict
unique_ordered = list(OrderedDict.fromkeys(sent_tokenize(data)))
output = ' '.join(unique_ordered)
Output: This is an example. It is another one. This is the third one.
Upvotes: 1