Reputation: 1679
I have a text dataset that is a list of lists of lists of strings. I need to Tokenize
this data to fit it into a classification model. I am very familiar with using keras.preprocessing.text.Tokenizer
to do so, and often do this with the below code:
data =
[[['not'],
['ahead'],
['um let me think'],
['thats not very encouraging if they had a cast of thousands on the other end']],
[['okay civil liberties tell me your position'],
['probably would go ahead']],
[['oh'],
['it up so i dont know where you really go'],
['well most of my problem with this latest task'],
['its some i kind of dont want to put in the time to do it'],
['right so im saying ive got a lot of other things to do']]]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
When I run this code on my data, I get the following error:
2 frames
<ipython-input-44-1da804f42cc8> in main()
12 # tokenize and vectorize text data to prepare for embedding
13 tokenizer = Tokenizer()
---> 14 tokenizer.fit_on_texts(new_corpus)
15 sequences = tokenizer.texts_to_sequences(new_corpus)
16 word_index = tokenizer.word_index
/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in fit_on_texts(self, texts)
213 if self.lower:
214 if isinstance(text, list):
--> 215 text = [text_elem.lower() for text_elem in text]
216 else:
217 text = text.lower()
/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in <listcomp>(.0)
213 if self.lower:
214 if isinstance(text, list):
--> 215 text = [text_elem.lower() for text_elem in text]
216 else:
217 text = text.lower()
AttributeError: 'list' object has no attribute 'lower'
This makes sense to me, because the Tokenizer
function expects a string but gets a list. Normally, I would flatten my list structure to pass it through the Tokenizer
function.
However, I cannot do this because my nested list structure is essential for my modeling.
How then, can I Tokenize
my data while preserving the list structure? I want to treat the whole thing as my corpus and get unique word integer tokens across all lists.
It should look something like this (tokenization done by hand here so pardon if there is a typo):
data =
[[[0],
['1'],
['2, 3, 4, 5'],
['6, 0, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19']],
[['20, 21, 22, 23, 3, 24, 25'],
['26, 27, 28, 29']],
[['30'],
['31, 32, 33, 34, 35, 36, 37, 38, 39, 40'],
['41, 42, 43, 44, 45, 46, 47, 48, 49'],
['50, 51, 34, 52, 14, 35, 53, 54, 55, 56, 17, 57, 58, 59, 31'],
['60, 61, 62, 63, 64, 65, 12, 66, 14, 67, 68, 59, 31']]]
Upvotes: 1
Views: 2135
Reputation: 11333
You can do the following to preserve the structure and do the indexing,
tok_data = [y[0] for x in data for y in x]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tok_data)
sequences = []
for x in data:
tmp = []
for y in x:
tmp.append(tokenizer.texts_to_sequences(y)[0])
sequences.append(tmp)
Upvotes: 2