Reputation: 332

nltk: word_tokenize changes quotes

I'm using Python's nltk and I want to tokenize a sentence containing quotes, but it turns " into `` and ''.

E.g:

>>> from nltk import word_tokenize

>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]

Why doesn't it keep the quotes like in the original sentence and how can this be solved?

Thanks

Upvotes: 3

Answers (2)

Kaushik Acharya

Reputation: 1590

Expanding the answer provided by Leb:

The URL for Penn Treebank Tokenization is no more available. But its present in ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html

Copy-pasting the content here:

Treebank tokenization
    Our tokenization is fairly simple:
  
most punctuation is split from adjoining words

double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged
separately.
Examples
     children's --> children 's
     parents' --> parents '
     won't --> wo n't
     gonna --> gon na
     I'm --> I 'm
This tokenization allows us to analyze each component separately, so (for example) "I" can be in the subject Noun Phrase while "'m" is the head of the main verb phrase.
There are some subtleties for hyphens vs. dashes, elipsis dots (...) and so on, but these often depend on the particular corpus or application of the tagged data.

In parsed corpora, bracket-like characters are converted to special 3-letter sequences, to avoid confusion with parse brackets. Some POS taggers, such as Adwait Ratnaparkhi's MXPOST, require this form for their input. In other words, these tokens in POS files: ( ) [ ] { } become, in parsed files: -LRB- -RRB- -RSB- -RSB- -LCB- -RCB- (The acronyms stand for (Left|Right) (Round|Square|Curly) Bracket.)

Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into one-sentence-per-line.

Example from Stanford:

https://nlp.stanford.edu/software/tokenizer.shtml

Command-line usage section shows example of how double quotes are changed as per the rules of Penn Treebank tokenization.

https://www.nltk.org/_modules/nltk/tokenize/treebank.html

class TreebankWordTokenizer shows how the changes have been implemented:

# starting quotes
 (re.compile(r"^\""), r"``")

# ending quotes
(re.compile(r'"'), " '' ")

Upvotes: 0

Leb

Reputation: 15953

It's actually meant to do that, not on accident. From Penn Treebank Tokenization

double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')

In previous version it didn't do that, but it was updated last year. In other words if you want to change you'll need to edit treebank.py

Upvotes: 2

nltk: word_tokenize changes quotes

Answers (2)

Related Questions