Reputation: 11
I am using Tensorflow for the first time following a tutorial on featurization with the new Google Recommenders package: https://www.tensorflow.org/recommenders/examples/featurization
I ran into trouble swapping out their dataset (MovieLens) for one based on the Kaggle wine data. The following code works as expected:
wine_title_lookup= tf.keras.layers.experimental.preprocessing.StringLookup()
wine_title_lookup.adapt(np.unique(wine_train['title']))
print(f"Vocabulary: {wine_title_lookup.get_vocabulary()[:3]}")
Vocabulary: ['', '[UNK]', 'Žitavské Vinice Rhine Riesling']
wine_title_embedding = tf.keras.layers.Embedding(
# Let's use the explicit vocabulary lookup.
input_dim=wine_title_lookup.vocab_size(),
output_dim=32
)
x= wine_title_lookup(["Susana Balbo Signature Malbec"])
x= wine_title_embedding(x)
x
<tf.Tensor: shape=(1, 32), dtype=float32, numpy= array([[-0.03861505, -0.02146437, 0.04332292, -0.02598745, 0.03842534, -0.01066433, 0.0292404 , 0.02783312, 0.03364438, 0.00054752, -0.0295071 , 0.03200008, 0.01224083, -0.00100452, -0.04346857, 0.00105418, -0.01640136, -0.01778026, 0.00171928, 0.03215903, 0.00020416, -0.02083766, -0.00323264, 0.02582215, 0.04805436, 0.0325211 , 0.0100181 , -0.04965406, 0.02548517, 0.01569786, 0.03761304, 0.01659941]], dtype=float32)>
However the following produces an error
wine_title_model = tf.keras.Sequential([wine_title_lookup, wine_title_embedding])
wine_title_model(["Susana Balbo Signature Malbec"])
AttributeError Traceback (most recent call last) in () ----> 1 wine_title_model(["Susana Balbo Signature Malbec"])
3 frames /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in call(self, *args, **kwargs) 983 984 with ops.enable_auto_cast_variables(self._compute_dtype_object): --> 985 outputs = call_fn(inputs, *args, **kwargs) 986 987 if self._activity_regularizer:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/sequential.py in call(self, inputs, training, mask) 370 if not self.built: 371 self._init_graph_network(self.inputs, self.outputs) --> 372 return super(Sequential, self).call(inputs, training=training, mask=mask) 373 374 outputs = inputs # handle the corner case where self.layers is empty
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py in call(self, inputs, training, mask) 384 """ 385 return self._run_internal_graph( --> 386 inputs, training=training, mask=mask) 387 388 def compute_output_shape(self, input_shape):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py in _run_internal_graph(self, inputs, training, mask) 482 masks = self._flatten_to_reference_inputs(mask) 483 for input_t, mask in zip(inputs, masks): --> 484 input_t._keras_mask = mask 485 486 # Dictionary mapping reference tensors to computed tensors.
AttributeError: 'str' object has no attribute '_keras_mask'
The Google code I based my script on uses a data format I am unfamiliar with which allows them to run map on their data. I tried converting my data into some tensorflow formats but could not seem to replicate their functionality. However this is the only step that is different and I cannot understand why the pieces of the Sequence op work individually but not as a whole.
I looked at some other examples from when this error has popped up on SO but could not find a solution to my problem. This what the raw data looks like.
wine_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 108655 entries, 0 to 120727
Data columns (total 16 columns):
Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 108600 non-null object
1 description 108652 non-null object
2 designation 77150 non-null object
3 points 108336 non-null float64
4 price 100871 non-null float64
5 province 108600 non-null object
6 region_1 108655 non-null object
7 region_2 42442 non-null object
8 title 108655 non-null object
9 variety 108655 non-null object
10 winery 108655 non-null object
11 designation_replace 108655 non-null object
12 user_id 108655 non-null int64
13 price_isna 108655 non-null bool
14 price_imputed 108650 non-null float64
15 wine_id 108655 non-null int64
dtypes: bool(1), float64(3), int64(2), object(10)
memory usage: 13.4+ MB
Upvotes: 1
Views: 2727
Reputation: 11
I solved this problem by fixing the following areas of my code
When I posted this question I had already tried running tf.data.Dataset.from_tensor_slices
on my pandas dataframe. However it will not work. Instead convert the dataframe to a dictionary as so: wine_features_dict = {name: np.array(value) for name, value in wine_train.items()}
and then everything runs smoothly.
I thought I got everything but just dropping all rows with any missing data seemed to get rid of the error. If it's happening to you make sure that there is no missing data and all your data is either integer or string.
Edited 13 Oct 2021:
Adding the full solution below - as requested
wine_features_dict = {name: np.array(value) for name, value in wine_train.items()}
import itertools
def slices(features):
for i in itertools.count():
# For each feature take index `i`
example = {name:values[i] for name, values in features.items()}
yield example
for example in slices(wine_features_dict):
for name, value in example.items():
print(f"{name:19s}: {value}")
break
create features_ds
features_ds = tf.data.Dataset.from_tensor_slices(wine_features_dict)
for example in features_ds:
for name, value in example.items():
print(f"{name:19s}: {value}")
break
which yields
country : b'Portugal'
description : b"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016."
points : 87.0
price : 15.0
province : b'Douro'
title : b'Quinta dos Avidagos 2011 Avidagos Red (Douro)'
variety : b'Portuguese Red'
winery : b'Quinta dos Avidagos'
designation_replace: b'Avidagos'
user_id : b'15'
price_isna : False
price_imputed : 15.0
wine_id : 1
Between this and the actual model definition there are a bunch of things too lengthy for a SlackOverflow post. Essentially we are going to create embeddings for categorical variables and normalize or discretize continuous features. I added timestamps that I also had to preprocess. Word embeddings for text and then combine them
wine titles, user_ids -> vocabularies -> embeddings
price, price imputed -> normalize or discretize (bucket) -> embed buckets
words -> tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.
timestamps -> bucket based on max and min -> embed buckets
inputs = {}
for name, column in wine_train.items():
dtype = column.dtype
if dtype == object:
dtype = tf.string
else:
dtype = tf.float32
inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)
and then the model
class UserModel(tf.keras.Model):
def __init__(self, use_timestamps=False, use_country_origin=False):
super().__init__()
self._use_timestamps = use_timestamps
self._use_country_origin = use_country_origin
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_user_ids, mask_token=None),
tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
])
'''
# Can also do this if user_id_lookup is defined
self.user_embedding = tf.keras.Sequential([
user_id_lookup,
tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),
])
'''
if use_country_origin:
self.country_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_countries, mask_token=None),
tf.keras.layers.Embedding(len(unique_countries) + 1, 32),
])
if use_timestamps:
self.timestamp_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)
])
self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()
self.normalized_timestamp.adapt(timestamps)
def call(self, inputs):
# If timestamps not active just do the user_id embedding
if not self._use_timestamps:
# Ignore country of origin if is not enabled
if not self._use_country_origin:
return self.user_embedding(inputs['user_id'])
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.country_embedding(inputs["country"]),
], axis=1)
# Take the input dictionary, pass it through each input layer,
# and concatenate the result.
if not self._use_country_origin:
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
self.normalized_timestamp(inputs["timestamp"]),
], axis=1)
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
self.normalized_timestamp(inputs["timestamp"]),
self.country_embedding(inputs["country"]),
], axis=1)
we can call it as well
user_model = UserModel()
# Delete quotes if timestamps are available
'''user_model.normalized_timestamp.adapt(
ratings.map(lambda x: x["timestamp"]).batch(128))
'''
for row in features_ds.batch(1).take(1):
print(f"Computed representations: {user_model(row)[0, :3]}")
Upvotes: 0
Reputation: 21
Try using
wine_title_model.predict(["Susana Balbo Signature Malbec"])
Upvotes: 2