user6903745
user6903745

Reputation: 5527

How to one-hot-encode sentences at the character level?

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

"hello" # h=7, e=4 l=11 o=14

would become

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Unfortunately OneHotEncoder from sklearn does not take as input string.

Upvotes: 2

Views: 9249

Answers (5)

kmario23
kmario23

Reputation: 61475

This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.

alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}

idxs = [alphabets[ch] for ch in 'hello']
print(idxs)
# [7, 4, 11, 11, 14]

# @divakar's approach
idxs = np.fromstring("hello",dtype=np.uint8)-97

# or for more clear understanding, use:
idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')

one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
sess = tf.InteractiveSession()

In [15]: one_hot.eval()
Out[15]: 
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

Upvotes: 8

Divakar
Divakar

Reputation: 221684

Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -

ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)

If you are looking for performance, I would suggest using an initialized array and then assign -

out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1

Sample run -

In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97

In [154]: ints
Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)

In [155]: out = (ints[:,None] == np.arange(26)).astype(int)

In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]

Upvotes: 3

boot-scootin
boot-scootin

Reputation: 12515

Just compare the letters in your passed string to a given alphabet:

def string_vectorizer(strng, alphabet=string.ascii_lowercase):
    vector = [[0 if char != letter else 1 for char in alphabet] 
                  for letter in strng]
    return vector

Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).

The output of string_vectorizer('hello'):

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Upvotes: 9

user2285236
user2285236

Reputation:

With pandas, you can use pd.get_dummies by passing a categorical Series:

import pandas as pd
import string
low = string.ascii_lowercase

pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
Out: 
   a  b  c  d  e  f  g  h  i  j ...  q  r  s  t  u  v  w  x  y  z
0  0  0  0  0  0  0  0  1  0  0 ...  0  0  0  0  0  0  0  0  0  0
1  0  0  0  0  1  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
2  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
4  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0

[5 rows x 26 columns]

Upvotes: 3

MassPikeMike
MassPikeMike

Reputation: 682

You asked about "sentences" but your example provided only a single word, so I'm not sure what you wanted to do about spaces. But as far as single words are concerned, your example could be implemented with:

def onehot(ltr):
 return [1 if i==ord(ltr) else 0 for i in range(97,123)]

def onehotvec(s):
 return [onehot(c) for c in list(s.lower())]

onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Upvotes: 1

Related Questions