Find unique sentences in a document

Question

I have a corpus of text with around 300k sentences. I want to have only unique sentences, this means that if I have a sentence with two times frequency I want to only have one of them.

This is what I tried in python 3:

def unique_sentences(data):
    u_sent = list(set([w for w in data.split('.')]))
    return ".".join(u_sent)

The problem is it removes also unique sentences. Do you know any other way to do it in python?

Parsa Abbasi · Accepted Answer

I suggest splitting the text data using a well-known library like the NLTK. I got the following results when I ran your code on a sample text:

Input: 'This is an example. It is another one. This is the third one. This is an example. This is an example.'

Output: .This is an example. This is the third one. This is an example. It is another one

But I got the desired results when I used the NLTK library to splitting the sentences using the following code:

from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
unique_sentences = set(sent_tokenize(data))

Output: {'It is another one.', 'This is the third one.', 'This is an example.'}

Furthermore, you can use the following method to getting unique sentences if you care about the order of the sentences:

from collections import OrderedDict
unique_ordered = list(OrderedDict.fromkeys(sent_tokenize(data)))
output = ' '.join(unique_ordered)

Output: This is an example. It is another one. This is the third one.

Find unique sentences in a document

Answers (1)

Related Questions