Reputation: 141
I have a string in the form :
'I am going to visit "Huge Hotel" and the "Grand River"'
I want it tokenized as
['I', 'am', 'going',..., 'Huge Hotel','and' ,'the' ,'Grand River']
As seen 'Huge Hotel' and 'Grand River' are taken as a single word as they were present in quotes.
import nltk
text = 'I am going to visit "Huge Hotel" and the "Grand River"'
b = nltk.word_tokenize(text)
I have written above code but it does'nt work
Upvotes: 3
Views: 947
Reputation: 122112
It looks odd but it works:
re.findall('"([^"]*)"', s)
: Find all substrings enclosed in double quotesphrase.replace(' ', '_')
: Replace all spaces with underscore in these substrings from Step 1.word_tokenize()
on the modified string.[out]:
>>> import re
>>> from nltk import word_tokenize
>>> s = 'I am going to visit "Huge Hotel" and the "Grand River"'
>>> for phrase in re.findall('"([^"]*)"', s):
... s = s.replace('"{}"'.format(phrase), phrase.replace(' ', '_'))
...
>>> s
'I am going to visit Huge_Hotel and the Grand_River'
>>> word_tokenize(s)
['I', 'am', 'going', 'to', 'visit', 'Huge_Hotel', 'and', 'the', 'Grand_River']
I'm sure there's a simpler regex operation that can replace the series of regex + string operations.
Upvotes: 1