Reputation: 98
I am using nltk sentence tokenizer to fetch sentences of files.
But it fails terribly when there are bullets/listed data.
Code I am using is:
dataFile = open(inputFile, 'r')
fileContent = dataFile.read()
fileContent = re.sub("\n+", " ", fileContent)
sentences = nltk.sent_tokenize(fileContent)
print(sentences)
I want the sentence tokenizer to give each bullet as a sentence.
Can someone please help me here? Thanks!
Edit1:
Raw ppt sample: http://pastebin.com/dbwKCESg
Processed ppt data: http://pastebin.com/0N64krKC
I will recieve only the processed data file and need to sentence tokenize on the same.
Upvotes: 2
Views: 1750
Reputation: 1420
Your question is a bit unclear but I tried your code and it seems to fail when trying to parse the bullets. I've added a function to strip non-printable characters and added a find/replace to replace newlines with periods. Printable strings on my python version are:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c
This code creates sentences out of your bullets, while still separating sentences out of the blocks of text. It would fail if sentences in the input text had newlines in the middle of them - which your example input does not.
import re, nltk, string
dataFile = open(inputFile, 'r')
fileContent = dataFile.read()
fileContent = re.sub("\n+", ".", fileContent)
fileContentAscii = ''.join(filter(lambda x:x in string.printable,fileContent))
sentences = nltk.sent_tokenize(fileContentAscii)
Upvotes: 2