Aaditya Ura
Aaditya Ura

Reputation: 12669

Deep learning : How to build character level embedding?

I am trying to use character level embedding in my model but I have few doubts regarding character level embedding.

So for word level embedding :

Sentence = 'this is a example sentence'

create the vocab :

vocab = {'this' : 0 , 'is' :1 , 'a': 2 'example' : 3, 'sentence' : 4 }

encode the sentence :

encoded_sentence = [ 0, 1 , 2 , 3 , 4 ]

now send it to any pre-trained embedding like word2vec or glove :

each id will be replaced with 300 or embedding dim :

embedding_sentence = [ [ 0.331,0.11 , ----300th dim ] , [ 0.331,0.11 , ----300th dim ] , [ 0.331,0.11 , ----300th dim ] , [ 0.331,0.11 , ----300th dim ] , [ 0.331,0.11 , ----300th dim ] ] 

and if we are dealing with batches then we pad the sentences

So the shape goes like this :

[ batch_size , max_sentence_length , embedding_dim ]

Now for character level embedding I have few doubts :

so for char level embedding :

Sentence = 'this is a example sentence'

create the char_vocab :

char_vocab = [' ', 'a', 'c', 'e', 'h', 'i', 'l', 'm', 'n', 'p', 's', 't', 'x']

int_to_vocab = {n:m for m,n in enumerate(char_vocab)}

encoded the sentence by char level :

Now here is my confusion , so in word embedding we first tokenise the sentence and then encode each token with vocab id ( word_id)

but for char embedding if I am tokenzing the sentence and then encoding with character level then shape will be 4 dim and I can't feed this to LSTM.

But if i am not tokenising and directly encoding raw text then it's 3 dim and I can feed it to LSTM

for example :

with tokenization :

 token_sentence = ['this','is','a','example','sentence']

encoded_char_level = []

for words in token_sentence:
    char_lvel = [int_to_vocab[char] for char in words]
    encoded_char_level.append(char_lvel)

it's look like this:

[[0, 1, 2, 3],
 [2, 3],
 [5],
 [6, 7, 5, 8, 9, 10, 6],
 [3, 6, 11, 0, 6, 11, 12, 6]]

Now we have to pad this for two level , one is char_level padding and second is sentence level padding:

char_level_padding:

[[0, 1, 2, 3, 0, 0, 0,0],
 [2, 3, 0, 0, 0, 0, 0, 0],
 [5, 0, 0, 0, 0, 0, 0, 0],
 [6, 7, 5, 8, 9, 10, 6, 0],
 [3, 6, 11, 0, 6, 11, 12, 6]]

Now if we have 4 sentences then we have to pad each sentence with max sentence len so shape will be :

[batch_size , max_sentence_length , max_char_length ] 

Now if we pass this to embedding layer then:

[ batch_size , max_sentence_length, max_char_length , embedding_dim ] 

Which is 4 dim.

How to encode sentences with character level and use it with tensorflow LSTM layer?

Because lstm takes 3 dim input [ batch_size , max_sequence_length , embedding_dim ] 

Can I use it something like :

[ Batch_size , ( max_sentence_length x max_char_length ) , dim ] 

so for example :

[ 12 , [ 3 x 4 ] , 300 ]

Upvotes: 3

Views: 4588

Answers (1)

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7369

You can concatenate the character level features with a fixed length.

For example:

``[[0, 1, 2, 3, 0, 0, 0,0],
  [2, 3, 0, 0, 0, 0, 0, 0],
  [5, 0, 0, 0, 0, 0, 0, 0],
  [6, 7, 5, 8, 9, 10, 6, 0],
  [3, 6, 11, 0, 6, 11, 12, 6]]``

can be changed to: [[0, 1, 2, 3, 0, 0, 0,0,2, 3, 0, 0, 0, 0, 0, 0,5, 0, 0, 0, 0, 0, 0, 0,6, 7, 5, 8, 9, 10, 6, 0,3, 6, 11, 0, 6, 11, 12, 6]]

Upvotes: 2

Related Questions