Reputation: 332
I'm using Python's nltk and I want to tokenize a sentence containing quotes, but it turns "
into ``
and ''
.
E.g:
>>> from nltk import word_tokenize
>>> sentence = 'He said "hey Bill!"'
>>> word_tokenize(sentence)
['He', 'said', '``', 'hey', 'Bill', '!', "''"]
Why doesn't it keep the quotes like in the original sentence and how can this be solved?
Thanks
Upvotes: 3
Views: 1111
Reputation: 1590
Expanding the answer provided by Leb:
The URL for Penn Treebank Tokenization is no more available. But its present in ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html
Copy-pasting the content here:
Treebank tokenization
Our tokenization is fairly simple:
most punctuation is split from adjoining words
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged
separately.Examples children's --> children 's parents' --> parents ' won't --> wo n't gonna --> gon na I'm --> I 'm
This tokenization allows us to analyze each component separately, so (for example) "I" can be in the subject Noun Phrase while "'m" is the head of the main verb phrase.
There are some subtleties for hyphens vs. dashes, elipsis dots (...) and so on, but these often depend on the particular corpus or application of the tagged data.
In parsed corpora, bracket-like characters are converted to special 3-letter sequences, to avoid confusion with parse brackets. Some POS taggers, such as Adwait Ratnaparkhi's MXPOST, require this form for their input. In other words, these tokens in POS files: ( ) [ ] { } become, in parsed files: -LRB- -RRB- -RSB- -RSB- -LCB- -RCB- (The acronyms stand for (Left|Right) (Round|Square|Curly) Bracket.)
Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into one-sentence-per-line.
Example from Stanford:
https://nlp.stanford.edu/software/tokenizer.shtml
Command-line usage section shows example of how double quotes are changed as per the rules of Penn Treebank tokenization.
https://www.nltk.org/_modules/nltk/tokenize/treebank.html
class TreebankWordTokenizer shows how the changes have been implemented:
# starting quotes
(re.compile(r"^\""), r"``")
# ending quotes
(re.compile(r'"'), " '' ")
Upvotes: 0
Reputation: 15953
It's actually meant to do that, not on accident. From Penn Treebank Tokenization
double quotes (") are changed to doubled single forward- and backward- quotes (`` and '')
In previous version it didn't do that, but it was updated last year. In other words if you want to change you'll need to edit treebank.py
Upvotes: 2