user3264316
user3264316

Reputation: 332

nltk: word_tokenize changes quotes

I'm using Python's nltk and I want to tokenize a sentence containing quotes, but it turns " into `` and ''.

E.g:

>>> from nltk import word_tokenize

>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]

Why doesn't it keep the quotes like in the original sentence and how can this be solved?

Thanks

Upvotes: 3

Views: 1111

Answers (2)

Kaushik Acharya
Kaushik Acharya

Reputation: 1590

Expanding the answer provided by Leb:

The URL for Penn Treebank Tokenization is no more available. But its present in ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html

Copy-pasting the content here:

Treebank tokenization

    Our tokenization is fairly simple:
  
  • most punctuation is split from adjoining words

  • double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')

  • verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged
    separately.

    Examples
         children's --> children 's
         parents' --> parents '
         won't --> wo n't
         gonna --> gon na
         I'm --> I 'm
    

    This tokenization allows us to analyze each component separately, so (for example) "I" can be in the subject Noun Phrase while "'m" is the head of the main verb phrase.

  • There are some subtleties for hyphens vs. dashes, elipsis dots (...) and so on, but these often depend on the particular corpus or application of the tagged data.

  • In parsed corpora, bracket-like characters are converted to special 3-letter sequences, to avoid confusion with parse brackets. Some POS taggers, such as Adwait Ratnaparkhi's MXPOST, require this form for their input. In other words, these tokens in POS files: ( ) [ ] { } become, in parsed files: -LRB- -RRB- -RSB- -RSB- -LCB- -RCB- (The acronyms stand for (Left|Right) (Round|Square|Curly) Bracket.)

    Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into one-sentence-per-line.

Example from Stanford:

https://nlp.stanford.edu/software/tokenizer.shtml

Command-line usage section shows example of how double quotes are changed as per the rules of Penn Treebank tokenization.

https://www.nltk.org/_modules/nltk/tokenize/treebank.html

class TreebankWordTokenizer shows how the changes have been implemented:

# starting quotes
 (re.compile(r"^\""), r"``")

# ending quotes
(re.compile(r'"'), " '' ")

Upvotes: 0

Leb
Leb

Reputation: 15953

It's actually meant to do that, not on accident. From Penn Treebank Tokenization

double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')

In previous version it didn't do that, but it was updated last year. In other words if you want to change you'll need to edit treebank.py

Upvotes: 2

Related Questions