catchingPatterns
catchingPatterns

Reputation: 98

How to use nltk sentence tokenizer in case of bulleted-data or listed data?

I am using nltk sentence tokenizer to fetch sentences of files.
But it fails terribly when there are bullets/listed data.

Example text

Code I am using is:

dataFile = open(inputFile, 'r')
fileContent = dataFile.read()
fileContent = re.sub("\n+", " ", fileContent)
sentences = nltk.sent_tokenize(fileContent)
print(sentences)

I want the sentence tokenizer to give each bullet as a sentence.

Can someone please help me here? Thanks!

Edit1:
Raw ppt sample: http://pastebin.com/dbwKCESg
Processed ppt data: http://pastebin.com/0N64krKC

I will recieve only the processed data file and need to sentence tokenize on the same.

Upvotes: 2

Views: 1750

Answers (1)

alex9311
alex9311

Reputation: 1420

Your question is a bit unclear but I tried your code and it seems to fail when trying to parse the bullets. I've added a function to strip non-printable characters and added a find/replace to replace newlines with periods. Printable strings on my python version are:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

This code creates sentences out of your bullets, while still separating sentences out of the blocks of text. It would fail if sentences in the input text had newlines in the middle of them - which your example input does not.

import re, nltk, string

dataFile = open(inputFile, 'r')
fileContent = dataFile.read()
fileContent = re.sub("\n+", ".", fileContent)
fileContentAscii = ''.join(filter(lambda x:x in string.printable,fileContent))
sentences = nltk.sent_tokenize(fileContentAscii)

Upvotes: 2

Related Questions