Bowen Peng
Bowen Peng

Reputation: 1815

Pytorch convert a pd.DataFrame which is variable length sequence to tensor

I get a pandas DataFrame as follows and want to convert it to torch.tensor for embedding.

# output first 5 rows examples
print(df['col'].head(5))

                      col
0             [a, bc, cd]
1      [d, ed, fsd, g, h]
2  [i, hh, ihj, gfw, hah]
3                 [a, cb]
4                   [sad]



train_tensor = torch.from_numpy(train)

But it gets an error:

TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.

It seems that from_numpy() doesn't support the variable lenght sequences.
So if want to initialize tensor form it what is the proper way?
And after getting the corresponding tensor I will try to add padding to variable length sequences and do embedding layer for it.
Could anyone help me?
Thanks in advances.

Upvotes: 0

Views: 801

Answers (2)

mujjiga
mujjiga

Reputation: 16856

There are multiple steps involved here

words to IDs

  • Pretrained: If you are using a pretrained embeddings like Glove/word2vec you will have to map each word to its ID in the vocabulary so that the embedding layer can load the pretrained embeddings.
  • In case you want to train your own embeddings you will have to map each word to an ID and save the map for later use (during predictions). This is normally called vocabulary
# Vocabulary to our own ID
def to_vocabulary_id(df):
  word2id = {}
  sentences = []
  for v in df['col'].values:
    row = []
    for w in v:
      if w not in word2id:
        word2id[w] = len(word2id)+1
      row.append(word2id[w])
      
    sentences.append(row)
  return sentences, word2id


df = pd.DataFrame({'col': [
                           ['a', 'bc', 'cd'], 
                           ['d', 'ed', 'fsd', 'g', 'h'], 
                           ['i', 'hh', 'ihj', 'gfw', 'hah'],
                           ['a', 'cb'],
                           ['sad']]})
sentences, word2id = to_vocabulary_id(df)

Embedding layer

If our vocabulary size is say 100 and embedding size is 8, then we will create an embedding layer as below

embedding = nn.Embedding(100, 8)

Pad variable length sentences to 0 and create Tensor

data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)

Run through the embedding layer

Finally

import torch
from torch.nn.utils.rnn import pad_sequence
        
data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)

embedding = nn.Embedding(100, 8)
embedding(data).shape

Output:

torch.Size([5, 5, 8])

As you can see we have passed 5 sentences and the max length is 5. So we get embeddings of size 5 X 5 X 8 ie. 5 sentences, 5 words each one having embedding of size 8.

Upvotes: 1

Victor Zuanazzi
Victor Zuanazzi

Reputation: 1964

There is a number of issues with what you are wanting to do:

  • Torch tensors (as described in the error) do no store strings, only numbers.
  • Torch tensors are mathematical tensors (multi dimensional matrices), which means that it has a well defined shape (you cannot store roles of different lenghs).

I would recommend you taking a look on how to train NLP (Natura Language Processing) models in one of this turorials: https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html They cover theory and practice of word2vec techniques and how to use it for different machine learning tasks.

I hope that helps =)

Upvotes: 1

Related Questions