Pytorch convert a pd.DataFrame which is variable length sequence to tensor

Question

I get a pandas DataFrame as follows and want to convert it to torch.tensor for embedding.

# output first 5 rows examples
print(df['col'].head(5))

                      col
0             [a, bc, cd]
1      [d, ed, fsd, g, h]
2  [i, hh, ihj, gfw, hah]
3                 [a, cb]
4                   [sad]



train_tensor = torch.from_numpy(train)

But it gets an error:

TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.

It seems that from_numpy() doesn't support the variable lenght sequences.
So if want to initialize tensor form it what is the proper way?
And after getting the corresponding tensor I will try to add padding to variable length sequences and do embedding layer for it.
Could anyone help me?
Thanks in advances.

mujjiga · Accepted Answer

There are multiple steps involved here

words to IDs

Pretrained: If you are using a pretrained embeddings like Glove/word2vec you will have to map each word to its ID in the vocabulary so that the embedding layer can load the pretrained embeddings.
In case you want to train your own embeddings you will have to map each word to an ID and save the map for later use (during predictions). This is normally called vocabulary

# Vocabulary to our own ID
def to_vocabulary_id(df):
  word2id = {}
  sentences = []
  for v in df['col'].values:
    row = []
    for w in v:
      if w not in word2id:
        word2id[w] = len(word2id)+1
      row.append(word2id[w])
      
    sentences.append(row)
  return sentences, word2id


df = pd.DataFrame({'col': [
                           ['a', 'bc', 'cd'], 
                           ['d', 'ed', 'fsd', 'g', 'h'], 
                           ['i', 'hh', 'ihj', 'gfw', 'hah'],
                           ['a', 'cb'],
                           ['sad']]})
sentences, word2id = to_vocabulary_id(df)

Embedding layer

If our vocabulary size is say 100 and embedding size is 8, then we will create an embedding layer as below

embedding = nn.Embedding(100, 8)

Pad variable length sentences to 0 and create Tensor

data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)

Run through the embedding layer

Finally

import torch
from torch.nn.utils.rnn import pad_sequence
        
data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)

embedding = nn.Embedding(100, 8)
embedding(data).shape

Output:

torch.Size([5, 5, 8])

As you can see we have passed 5 sentences and the max length is 5. So we get embeddings of size 5 X 5 X 8 ie. 5 sentences, 5 words each one having embedding of size 8.

Pytorch convert a pd.DataFrame which is variable length sequence to tensor

Answers (2)

words to IDs

Embedding layer

Pad variable length sentences to 0 and create Tensor

Run through the embedding layer

Related Questions