Saul_goodman
Saul_goodman

Reputation: 141

Tokenizing a string having double quotes

I have a string in the form :

'I am going to visit "Huge Hotel" and the "Grand River"'

I want it tokenized as

['I', 'am', 'going',..., 'Huge Hotel','and' ,'the' ,'Grand River']

As seen 'Huge Hotel' and 'Grand River' are taken as a single word as they were present in quotes.

import nltk
text = 'I am going to visit "Huge Hotel" and the "Grand River"'
b = nltk.word_tokenize(text)

I have written above code but it does'nt work

Upvotes: 3

Views: 947

Answers (1)

alvas
alvas

Reputation: 122112

It looks odd but it works:

  1. re.findall('"([^"]*)"', s): Find all substrings enclosed in double quotes
  2. phrase.replace(' ', '_'): Replace all spaces with underscore in these substrings from Step 1.
  3. Replace all the strings enclosed in double quotes with the underscored substrings from Step 2.
  4. Use word_tokenize() on the modified string.

[out]:

>>> import re
>>> from nltk import word_tokenize
>>> s = 'I am going to visit "Huge Hotel" and the "Grand River"'
>>> for phrase in re.findall('"([^"]*)"', s):
...     s = s.replace('"{}"'.format(phrase), phrase.replace(' ', '_'))
... 
>>> s
'I am going to visit Huge_Hotel and the Grand_River'
>>> word_tokenize(s)
['I', 'am', 'going', 'to', 'visit', 'Huge_Hotel', 'and', 'the', 'Grand_River']

I'm sure there's a simpler regex operation that can replace the series of regex + string operations.

Upvotes: 1

Related Questions