Reputation: 1815
I get a pandas DataFrame
as follows and want to convert it to torch.tensor
for embedding.
# output first 5 rows examples
print(df['col'].head(5))
col
0 [a, bc, cd]
1 [d, ed, fsd, g, h]
2 [i, hh, ihj, gfw, hah]
3 [a, cb]
4 [sad]
train_tensor = torch.from_numpy(train)
But it gets an error:
TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.
It seems that from_numpy()
doesn't support the variable lenght sequences.
So if want to initialize tensor
form it what is the proper way?
And after getting the corresponding tensor
I will try to add padding to variable length sequences and do embedding layer for it.
Could anyone help me?
Thanks in advances.
Upvotes: 0
Views: 801
Reputation: 16856
There are multiple steps involved here
# Vocabulary to our own ID
def to_vocabulary_id(df):
word2id = {}
sentences = []
for v in df['col'].values:
row = []
for w in v:
if w not in word2id:
word2id[w] = len(word2id)+1
row.append(word2id[w])
sentences.append(row)
return sentences, word2id
df = pd.DataFrame({'col': [
['a', 'bc', 'cd'],
['d', 'ed', 'fsd', 'g', 'h'],
['i', 'hh', 'ihj', 'gfw', 'hah'],
['a', 'cb'],
['sad']]})
sentences, word2id = to_vocabulary_id(df)
If our vocabulary size is say 100 and embedding size is 8, then we will create an embedding layer as below
embedding = nn.Embedding(100, 8)
data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)
Finally
import torch
from torch.nn.utils.rnn import pad_sequence
data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)
embedding = nn.Embedding(100, 8)
embedding(data).shape
Output:
torch.Size([5, 5, 8])
As you can see we have passed 5 sentences and the max length is 5. So we get embeddings of size 5 X 5 X 8
ie. 5 sentences, 5 words each one having embedding of size 8.
Upvotes: 1
Reputation: 1964
There is a number of issues with what you are wanting to do:
I would recommend you taking a look on how to train NLP (Natura Language Processing) models in one of this turorials: https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html They cover theory and practice of word2vec techniques and how to use it for different machine learning tasks.
I hope that helps =)
Upvotes: 1