DanielS
DanielS

Reputation: 33

Include punctuation in keras tokenizer

Is there any way to include punctuation in keras tokenizer?
I would like to have a transformation...

FROM

Tomorrow will be cold.

TO

Index-tomorrow, Index-will,...,Index-point

How can I achieve that?

Upvotes: 3

Views: 5473

Answers (2)

FabioL
FabioL

Reputation: 999

A general solutions, inspired by the one proposed by lmartens, using Regex expressions to replace a set of punctuation marks. Here the code:

from keras.preprocessing.text import Tokenizer
import re

to_exclude = '!"#$%&()*+-/:;<=>@[\\]^_`{|}~\t\n'
to_tokenize = '.,:;!?'
t = Tokenizer(filters=to_exclude) # all without .
text = "Tomorrow, will be. cold?"
text = re.sub(r'(['+to_tokenize+'])', r' \1 ', text)
t.fit_on_texts([text])
print(t.word_index) # {'tomorrow': 1, ',': 2, 'will': 3, 'be': 4, '.': 5, 'cold': 6, '?': 7}

Upvotes: 2

lmartens
lmartens

Reputation: 1502

This is possible if you do some pre-processing on the text.

First you want to make sure that the punctuation is not filtered out by the Tokenizer. You can see from the documentation that the Tokenizer takes a filter argument on initialization. You can replace the default value with the set of characters you would like to filter, and exclude the ones you want to have in your index.

The second part is making sure that the punctuation is recognized as its own token. If you tokenize the example sentence the result would take "cold." as a token instead of "cold" and ".". What you need is a seperator between the word and the punctuation. A naive approach is to replace the punctuation in the text with a space + punctuation.

Following code does what you ask:

from keras.preprocessing.text import Tokenizer

t = Tokenizer(filters='!"#$%&()*+,-/:;<=>?@[\\]^_`{|}~\t\n') # all without .
text = "Tomorrow will be cold."
text = text.replace(".", " .")
t.fit_on_texts([text])
print(t.word_index)

-> prints: {'will': 2, 'be': 3, 'cold': 4, 'tomorrow': 1, '.': 5}

The replace logic can be done in a smarter way (eg. with regex if you want to capture all punctuation), but you get the gist.

Upvotes: 10

Related Questions