Reputation: 1945
I'm currently working in one dataset that contains more than 10000+ news and I want to delete the sentences that contain only one word. I have searched about nltk and textcleaner, however I wasn't able to delete the sentences that contain only one word.
For example let say: Input: I want to delete sentence with one word. Okay. Fine.Let's do it. Output: I want to delete sentence with one word. Let's do it.
The code is:
import textcleaner as tc
import nltk
import numpy as np
datafile = np.genfromtxt("f12filtered.txt", encoding='utf-8', delimiter=".")
data = tc.document(datafile)
data.remove_stpwrds()
Upvotes: 0
Views: 1675
Reputation: 464
Data can be split into a list of sentences using delimiter '.'.And then if there is only one word in a sentence, we can delete that sentence. Data would be a list now and you can join the list if you want to work with complete text or else use it as it is. You can do this using the following code:
data = data.split('.')
for sent in data:
sent = sent.split(' ')
if len(sent) < 2:
data.remove((' ').join(sent))
To join data to form a single string:
data = ('.').join(data)
Upvotes: 2