Reputation: 30605
I have a txt file containing text
Table of Contents
Preface 1
Chapter 1: Tokenizing Text and WordNet Basics 7
Tokenizing text into sentences 8
Tokenizing sentences into words 10
Tokenizing sentences using regular expressions 12
If the string I have is :
input = "Tokenzing sentence using expressions"
I thought of using beginning and ending words to extract the sentence but there are lot of repetitions.
So whats the best way to get the output
Tokenizing sentences using regular expressions
Upvotes: 1
Views: 1418
Reputation: 16942
If you are prepared to preprocess your chapter headings, eliminating page numbers and stuff, this:
import difflib
contents = ["Tokenizing Text and WordNet Basics",
"Tokenizing text into sentences",
"Tokenizing sentences into words",
"Tokenizing sentences using regular expressions"]
input = "Tokenzing sentence using expressions"
print (difflib.get_close_matches(input, contents, n=1))
will give you this output:
['Tokenizing sentences using regular expressions']
Upvotes: 4