Reputation: 13
I'm trying to load my pandas dataframe (df) into a Tensorflow dataset with the following command:
target = df['label']
features = df['encoded_sentence']
dataset = tf.data.Dataset.from_tensor_slices((features.values, target.values))
Here's an excerpt from my pandas dataframe:
+-------+-----------------------+------------------+
| label | sentence | encoded_sentence |
+-------+-----------------------+------------------+
| 0 | Hello world | [5, 7] |
+-------+-----------------------+------------------+
| 1 | my name is john smith | [1, 9, 10, 2, 6] |
+-------+-----------------------+------------------+
| 1 | Hello! My name is | [5, 3, 9, 10] |
+-------+-----------------------+------------------+
| 0 | foo baar | [8, 4] |
+-------+-----------------------+------------------+
# df.dtypes gives me:
label int8
sentence object
encoded_sentencee object
But it keeps giving me a Value Error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
Can anyone tell me how to use the encoded sentences in my Tensorflow dataset? Help would be greatly appreciated!
Upvotes: 1
Views: 1286
Reputation: 59701
You can make your Pandas values into a ragged tensor first and then make the dataset from it:
import tensorflow as tf
import pandas as pd
df = pd.DataFrame({'label': [0, 1, 1, 0],
'sentence': ['Hello world', 'my name is john smith',
'Hello! My name is', 'foo baar'],
'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
[5, 3, 9, 10], [8, 4]]})
features = tf.ragged.stack(list(df['encoded_sentence']))
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
for f, t in dataset:
print(f.numpy(), t.numpy())
Output:
[5 7] 0
[ 1 9 10 2 6] 1
[ 5 3 9 10] 1
[8 4] 0
Note you may want to use padded_batch
to get batches of examples from the dataset.
EDIT: Since padded-batching does not seem to work with a dataset made from a ragged tensor at the moment, you can also convert the ragged tensor to a regular one first:
import tensorflow as tf
import pandas as pd
df = pd.DataFrame({'label': [0, 1, 1, 0],
'sentence': ['Hello world', 'my name is john smith',
'Hello! My name is', 'foo baar'],
'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
[5, 3, 9, 10], [8, 4]]})
features_ragged = tf.ragged.stack(list(df['encoded_sentence']))
features = features_ragged.to_tensor(default_value=-1)
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
batches = dataset.batch(2)
for f, t in batches:
print(f.numpy(), t.numpy())
Output:
[[ 5 7 -1 -1 -1]
[ 1 9 10 2 6]] [0 1]
[[ 5 3 9 10 -1]
[ 8 4 -1 -1 -1]] [1 0]
Upvotes: 1