Reputation: 3123
All the answers I have read on stackoverflow for similar errors either suggested to fix null values or fix the datatypes. I neither have nulls in my dataframe or floats. But, the error still persists.
Here's some information about my data:
about null values(as I know numpy.nan
s are encoded as floats in pandas):
about data types:
And when I do:
from tensorflow.keras.preprocessing.text import Tokenizer
title_tokeniser = Tokenizer(num_words=10)
title_tokeniser.fit_on_texts(train_set.loc[:,'title'] + test_set.loc[:,'title'])
This is the error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-38-26b704f1c0a1> in <module>()
1 title_tokeniser = Tokenizer(num_words=10)
----> 2 title_tokeniser.fit_on_texts(train_set.loc[:,'title'] + test_set.loc[:,'title'])
3
4 # unique tokens found in titles are:
5 title_token_index = title_tokeniser.word_index
1 frames
/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in fit_on_texts(self, texts)
223 self.filters,
224 self.lower,
--> 225 self.split)
226 for w in seq:
227 if w in self.word_counts:
/usr/local/lib/python3.6/dist-packages/keras_preprocessing/text.py in text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'float' object has no attribute 'lower'
EDIT:
Here's what I have done in the process of debugging and trying to fix this problem:
I even made sure that I have removed all numbers by doing:
import re
text = re.sub(r'[+-]?([0-9]*[.])?[0-9]+', ' ', text)
on each row of all columns of both my train and test sets.
And also checked if turning off lower
argument does anything by initialising a Tokenizer
instance by doing:
title_tokeniser = Tokenizer(num_words=10, lower=None)
But, the error is:
AttributeError: 'float' object has no attribute 'translate'
I couldn't trace the presence of any floats or nulls in my data. How do I fix this?
Upvotes: 1
Views: 5036
Reputation: 2086
Try this
texts = pd.concat([train_set['title'] , test_set['title']],axis=0).astype("str")
from tensorflow.keras.preprocessing.text import Tokenize
title_tokeniser = Tokenizer(num_words=10)
title_tokeniser.fit_on_texts(texts)
Upvotes: 5