sahilmanekia
sahilmanekia

Reputation: 11

'str' object has no attribute '_keras_mask' error when using tf.keras.Sequential

Background

I am using Tensorflow for the first time following a tutorial on featurization with the new Google Recommenders package: https://www.tensorflow.org/recommenders/examples/featurization

I ran into trouble swapping out their dataset (MovieLens) for one based on the Kaggle wine data. The following code works as expected:

wine_title_lookup= tf.keras.layers.experimental.preprocessing.StringLookup()
wine_title_lookup.adapt(np.unique(wine_train['title']))
print(f"Vocabulary: {wine_title_lookup.get_vocabulary()[:3]}")

Vocabulary: ['', '[UNK]', 'Žitavské Vinice Rhine Riesling']

wine_title_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=wine_title_lookup.vocab_size(),
    output_dim=32
)
x= wine_title_lookup(["Susana Balbo Signature Malbec"])

x= wine_title_embedding(x)

x

<tf.Tensor: shape=(1, 32), dtype=float32, numpy= array([[-0.03861505, -0.02146437, 0.04332292, -0.02598745, 0.03842534, -0.01066433, 0.0292404 , 0.02783312, 0.03364438, 0.00054752, -0.0295071 , 0.03200008, 0.01224083, -0.00100452, -0.04346857, 0.00105418, -0.01640136, -0.01778026, 0.00171928, 0.03215903, 0.00020416, -0.02083766, -0.00323264, 0.02582215, 0.04805436, 0.0325211 , 0.0100181 , -0.04965406, 0.02548517, 0.01569786, 0.03761304, 0.01659941]], dtype=float32)>

However the following produces an error

wine_title_model = tf.keras.Sequential([wine_title_lookup, wine_title_embedding])

wine_title_model(["Susana Balbo Signature Malbec"])

AttributeError Traceback (most recent call last) in () ----> 1 wine_title_model(["Susana Balbo Signature Malbec"])

3 frames /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py in call(self, *args, **kwargs) 983 984 with ops.enable_auto_cast_variables(self._compute_dtype_object): --> 985 outputs = call_fn(inputs, *args, **kwargs) 986 987 if self._activity_regularizer:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/sequential.py in call(self, inputs, training, mask) 370 if not self.built: 371 self._init_graph_network(self.inputs, self.outputs) --> 372 return super(Sequential, self).call(inputs, training=training, mask=mask) 373 374 outputs = inputs # handle the corner case where self.layers is empty

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py in call(self, inputs, training, mask) 384 """ 385 return self._run_internal_graph( --> 386 inputs, training=training, mask=mask) 387 388 def compute_output_shape(self, input_shape):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py in _run_internal_graph(self, inputs, training, mask) 482 masks = self._flatten_to_reference_inputs(mask) 483 for input_t, mask in zip(inputs, masks): --> 484 input_t._keras_mask = mask 485 486 # Dictionary mapping reference tensors to computed tensors.

AttributeError: 'str' object has no attribute '_keras_mask'

Notable differences with the source material

The Google code I based my script on uses a data format I am unfamiliar with which allows them to run map on their data. I tried converting my data into some tensorflow formats but could not seem to replicate their functionality. However this is the only step that is different and I cannot understand why the pieces of the Sequence op work individually but not as a whole.

I looked at some other examples from when this error has popped up on SO but could not find a solution to my problem. This what the raw data looks like.

wine_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 108655 entries, 0 to 120727
Data columns (total 16 columns):
    Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   country              108600 non-null  object 
 1   description          108652 non-null  object 
 2   designation          77150 non-null   object 
 3   points               108336 non-null  float64
 4   price                100871 non-null  float64
 5   province             108600 non-null  object 
 6   region_1             108655 non-null  object 
 7   region_2             42442 non-null   object 
 8   title                108655 non-null  object 
 9   variety              108655 non-null  object 
 10  winery               108655 non-null  object 
 11  designation_replace  108655 non-null  object 
 12  user_id              108655 non-null  int64  
 13  price_isna           108655 non-null  bool   
 14  price_imputed        108650 non-null  float64
 15  wine_id              108655 non-null  int64  
dtypes: bool(1), float64(3), int64(2), object(10)
memory usage: 13.4+ MB

Upvotes: 1

Views: 2727

Answers (2)

sahilmanekia
sahilmanekia

Reputation: 11

I solved this problem by fixing the following areas of my code

  1. Converting wine_train to a Tensorflow format

When I posted this question I had already tried running tf.data.Dataset.from_tensor_slices on my pandas dataframe. However it will not work. Instead convert the dataframe to a dictionary as so: wine_features_dict = {name: np.array(value) for name, value in wine_train.items()} and then everything runs smoothly.

  1. Tensorflow is very sensitive to missing or NaN values.

I thought I got everything but just dropping all rows with any missing data seemed to get rid of the error. If it's happening to you make sure that there is no missing data and all your data is either integer or string.


Edited 13 Oct 2021:

Adding the full solution below - as requested

We want to convert data into a tf dictionary
wine_features_dict = {name: np.array(value) for name, value in wine_train.items()}

import itertools

def slices(features):
  for i in itertools.count():
    # For each feature take index `i`
    example = {name:values[i] for name, values in features.items()}
    yield example

for example in slices(wine_features_dict):
  for name, value in example.items():
    print(f"{name:19s}: {value}")
  break

create features_ds

features_ds = tf.data.Dataset.from_tensor_slices(wine_features_dict)

for example in features_ds:
  for name, value in example.items():
    print(f"{name:19s}: {value}")
  break

which yields

country            : b'Portugal'
description        : b"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's  already drinkable, although it will certainly be better from 2016."
points             : 87.0
price              : 15.0
province           : b'Douro'
title              : b'Quinta dos Avidagos 2011 Avidagos Red (Douro)'
variety            : b'Portuguese Red'
winery             : b'Quinta dos Avidagos'
designation_replace: b'Avidagos'
user_id            : b'15'
price_isna         : False
price_imputed      : 15.0
wine_id            : 1
Preprocessing and other stuff

Between this and the actual model definition there are a bunch of things too lengthy for a SlackOverflow post. Essentially we are going to create embeddings for categorical variables and normalize or discretize continuous features. I added timestamps that I also had to preprocess. Word embeddings for text and then combine them

wine titles, user_ids -> vocabularies -> embeddings

price, price imputed -> normalize or discretize (bucket) -> embed buckets

words -> tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.

timestamps -> bucket based on max and min -> embed buckets

Next I explicitly defined the training inputs
inputs = {}

for name, column in wine_train.items():
  dtype = column.dtype
  if dtype == object:
    dtype = tf.string
  else:
    dtype = tf.float32

  inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)

and then the model

class UserModel(tf.keras.Model):
  
  def __init__(self, use_timestamps=False, use_country_origin=False):
    super().__init__()

    self._use_timestamps = use_timestamps
    self._use_country_origin = use_country_origin

    self.user_embedding = tf.keras.Sequential([
        tf.keras.layers.experimental.preprocessing.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
    ])   

    '''
    # Can also do this if user_id_lookup is defined 
    self.user_embedding = tf.keras.Sequential([
        user_id_lookup,
        tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),
    ])
    '''
    if use_country_origin:
      self.country_embedding = tf.keras.Sequential([
          tf.keras.layers.experimental.preprocessing.StringLookup(
              vocabulary=unique_countries, mask_token=None),
          tf.keras.layers.Embedding(len(unique_countries) + 1, 32),                                                   
      ])      
    
    if use_timestamps:
      self.timestamp_embedding = tf.keras.Sequential([
        tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
        tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)
      ])
      self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()

      self.normalized_timestamp.adapt(timestamps)
    
    
  def call(self, inputs):

    # If timestamps not active just do the user_id embedding

    if not self._use_timestamps:
      
      # Ignore country of origin if is not enabled 
      if not self._use_country_origin:
        return self.user_embedding(inputs['user_id'])
      
      return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.country_embedding(inputs["country"]),                        
      ], axis=1)

    # Take the input dictionary, pass it through each input layer,
    # and concatenate the result.
    if not self._use_country_origin:
      return tf.concat([
          self.user_embedding(inputs["user_id"]),
          self.timestamp_embedding(inputs["timestamp"]),
          self.normalized_timestamp(inputs["timestamp"]),           
      ], axis=1)

    return tf.concat([        
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        self.normalized_timestamp(inputs["timestamp"]),
        self.country_embedding(inputs["country"]),
    ], axis=1)
    

we can call it as well

user_model = UserModel()

# Delete quotes if timestamps are available 
'''user_model.normalized_timestamp.adapt(
    ratings.map(lambda x: x["timestamp"]).batch(128))
'''
for row in features_ds.batch(1).take(1):
  print(f"Computed representations: {user_model(row)[0, :3]}")

Upvotes: 0

Narayan Kothari
Narayan Kothari

Reputation: 21

Try using

wine_title_model.predict(["Susana Balbo Signature Malbec"])

Upvotes: 2

Related Questions