Reputation: 2353
I have a multiple files with different structure I would like to tokenize.
For example file 1:
event_name, event_location, event_description, event_priority
file2:
event_name, event_participants, event_location,event_description, event_priority
and so on. I would like to create create an array with data from all files and then tokenize it. unfortunately when I run tokenizer.fit_on_texts()
in loop dictionary is not expanding but is overwritten. I have to use tokenizer in loop because I need to pad event_description
my code:
tokenizer = Tokenizer(num_words=50000, oov_token="<OOV>")
for file in files:
print("Loading : ", file)
events= pd.read_csv(file)
# prepare columns
events['event_name'] = 'XXBOS XXEN ' + events['event_name'].astype(str)
events['event_location'] = 'XXEL ' + events['event_location'].astype(str)
events['event_description'] = 'XXED ' + events['event_description'].astype(str)
events['event_priority'] = 'XXEP ' + events['event_priority'].astype(str) + ' XXEOS'
# Tokenize concatenated columns into one
tokenizer.fit_on_texts(np.concatenate((events['event_name'],events['event_location'], events['event_description'], events['event_priority']), axis=0))
# Later I run texts_to_sequences on each column so later i am able to run pad_sequences on it and again I concatenate them
When I check tokenizer.word_index
, the tokens like XXBOS
location is changing between loop iterations. Is it possible to fulfill word_index dictionary instead of overwriting it?
EDIT: experiment:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(filters='.!"#$%&()*+,-/:;<=>?@[\\]^_`{|}~\t\n', num_words=50000, oov_token="<OOV>", split=' ', char_level=False)
text = "some text for test"
tokenizer.word_index
{'<OOV>': 1, 't': 2, 'e': 3, 's': 4, 'o': 5, 'm': 6, 'x': 7, 'f': 8, 'r': 9}
and adding another text to fit:
text2="new sentence with unknown chars xxasqeew"
tokenizer.fit_on_texts(text2)
tokenizer.word_index
{'<OOV>': 1, 'e': 2, 't': 3, 'n': 4, 's': 5, 'w': 6, 'o': 7, 'x': 8, 'r': 9, 'c': 10, 'h': 11, 'a': 12, 'm': 13, 'f': 14, 'i': 15, 'u': 16, 'k': 17, 'q': 18}
indexes in tokenizer changed completely
Upvotes: 2
Views: 596
Reputation: 11364
The dict is not overwritten, it is updated. The words' ordering changes after each iteration because fit_on_texts
sorts word index by the number of word's occurences (eg. the most common word is at the index "1", second most common at index "2", etc. ("0" index is reserved)).
An example:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
text1 = ["aaa bbb ccc"]
tokenizer.fit_on_texts(text1)
print("1. iteration", tokenizer.word_index)
text2 = ["bbb ccc ddd"]
tokenizer.fit_on_texts(text2)
print("2. iteration", tokenizer.word_index)
text3 = ["ccc ddd eee"]
tokenizer.fit_on_texts(text3)
print("3. iteration", tokenizer.word_index)
# "ccc" occurs three times
# "bbb" occurs twice
# "ddd" occurs twice
# "aaa" occurs once
# "eee" occurs once
# The actual output:
# 1. iteration {'aaa': 1, 'bbb': 2, 'ccc': 3}
# 2. iteration {'bbb': 1, 'ccc': 2, 'aaa': 3, 'ddd': 4}
# 3. iteration {'ccc': 1, 'bbb': 2, 'ddd': 3, 'aaa': 4, 'eee': 5}
Upvotes: 1
Reputation: 2744
Just store your events, and then tokenize all the events at once:
def create_tokenizer():
return Tokenizer(num_words=50000, oov_token="<OOV>")
all_events = []
files_to_tokens_dict = {}
for file in files:
print("Loading : ", file)
events= pd.read_csv(file)
# prepare columns
events['event_name'] = 'XXBOS XXEN ' + events['event_name'].astype(str)
events['event_location'] = 'XXEL ' + events['event_location'].astype(str)
events['event_description'] = 'XXED ' + events['event_description'].astype(str)
events['event_priority'] = 'XXEP ' + events['event_priority'].astype(str) + ' XXEOS'
# Tokenize concatenated columns into one
all_events.append(events['event_name'])
all_events.append(events['event_location'])
all_events.append(events['event_description'])
all_events.append(events['event_priority'])
tokenizer = create_tokenizer()
tokenizer.fit_on_text(events['event_name'], events['event_location'], events['event_description'], events['event_priority'])
tokens_in_current_file = tokenizer.word_index.keys()
files_to_tokens_dict[file] = tokens_in_current_file
global_tokenizer = create_tokenizer()
global_tokenizer.fit_on_texts(all_events)
global_tokenizer.word_index # one word index with all tokens
Then if you want to retrieve the indices of the tokens:
def get_token_indices(file):
tokens_in_file = files_to_tokens_dict[file]
result = []
for token in tokens_in_file:
global_token_index = global_tokenizer.word_index[token]
result.append(global_token_index)
return result
Upvotes: 0