batlike
batlike

Reputation: 698

Convert string of length n to a matrix of n x len(alphabet)

Suppose we are given a String "AABCD" with length n = 5, from an alphabet {'A', 'B', 'C', 'D', 'E', 'F'} with dimension len(alphabet) = 6. What is a Pythonic way of converting this string to a 5 x 6 matrix?

ie.

#INPUT:
string = "AABCD"
alphabet = {'A', 'B', 'C', 'D', 'E', 'F'}
#OUTPUT
output = 
        A B C D E F
char 1[ 1 0 0 0 0 0 ]
char 2[ 1 0 0 0 0 0 ]
char 3[ 0 1 0 0 0 0 ]
char 4[ 0 0 1 0 0 0 ]
char 5[ 0 0 0 1 0 0 ]

I scoured other answers but have yet to find a question that is similar. Suggestions greatly appreciated!

Upvotes: 1

Views: 146

Answers (6)

oppressionslayer
oppressionslayer

Reputation: 7204

Here's mine, it works with different size values too as shown:

df = pd.DataFrame(((pd.Series([*string])*len(alphabet)).str.split("", n=-1, expand=True).drop(columns=[0, len(alphabet)+1]).eq(list(sorted(alphabet)))*1)).rename(index=lambda x: f'Char {x+1}', columns=lambda x: f'{chr(x+64)}')                                                                                                                                                                             

In [1661]: df                                                                                                                                                                                  
Out[1661]: 
        A  B  C  D  E  F
Char 1  1  0  0  0  0  0
Char 2  1  0  0  0  0  0
Char 3  0  1  0  0  0  0
Char 4  0  0  1  0  0  0
Char 5  0  0  0  1  0  0

or

string = 'AABCDEEF'
alphabet = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'}

df = pd.DataFrame(((pd.Series([*string])*len(alphabet)).str.split("", n=-1, expand=True).drop(columns=[0, len(alphabet)+1]).eq(list(sorted(alphabet)))*1)).rename(index=lambda x: f'Char {x+1}', columns=lambda x: f'{chr(x+64)}')

        A  B  C  D  E  F  G  H
Char 1  1  0  0  0  0  0  0  0
Char 2  1  0  0  0  0  0  0  0
Char 3  0  1  0  0  0  0  0  0
Char 4  0  0  1  0  0  0  0  0
Char 5  0  0  0  1  0  0  0  0
Char 6  0  0  0  0  1  0  0  0
Char 7  0  0  0  0  1  0  0  0
Char 8  0  0  0  0  0  1  0  0

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

You can use pandas a do this is very few lines:

import pandas as pd
string1 = "AABCD"
df = pd.Series([*string1]).str.get_dummies()
df = df.rename(index=lambda x: f'Char {x+1}')
print(df)

Output as pandas dataframe:

        A  B  C  D
Char 1  1  0  0  0
Char 2  1  0  0  0
Char 3  0  1  0  0
Char 4  0  0  1  0
Char 5  0  0  0  1

Note, a piece of syntactic sugar is the unpacking of a string into a list of characters using [*'string'] results in ['s','t','r','i','n','g'].

Upvotes: 1

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

For your exact output:

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

print(f'output = \n\t{" ".join(alphabet)}')
for ix,char in enumerate(string, start=1):
    x = [0]*len(alphabet)
    x[alphabet.index(char)] = 1
    print(f'char {ix} {x}'.replace(',',''))

Output:

output = 
        A B C D E F
char 1 [1 0 0 0 0 0]
char 2 [1 0 0 0 0 0]
char 3 [0 1 0 0 0 0]
char 4 [0 0 1 0 0 0]
char 5 [0 0 0 1 0 0]

Upvotes: 1

Tank
Tank

Reputation: 521

Another solution that is slightly neater and maybe more general:

import numpy as np
alphabet =["A","B","C","D","E","F"]


alphabet_dict = {}
for i,x in enumerate(alphabet):
   alphabet_dict[x] = i


string = ["A", "A", "B", "C", "D"]

output = np.zeros((len(alphabet), len(string)))

for i,x in enumerate(string):
    output[i][alphabet_dict[x]] = 1

Hope this helps.

Upvotes: 1

Benyamin Karimi
Benyamin Karimi

Reputation: 163

you can use this code:

string = "AABCD"
#use array insted set type
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']
#global matrix
mat=[]
#get length of string to create one-hot vector for evry  character
l=len(alphabet)
for i in string:
    indx=alphabet.index(i)
    sub=[0] * l
    sub[indx]=1
    mat.append(sub)

output :

[[1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 0, 1, 0, 0]]

Upvotes: 1

Devesh Kumar Singh
Devesh Kumar Singh

Reputation: 20490

A simple double for loop will do

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

matrix = [[0 for _ in range(len(alphabet))] for _ in range(len(string))]

for i, s in enumerate(string):
    for j, a in enumerate(alphabet):
        matrix[i][j] = 1 if s == a else 0

print(matrix)

The output will be

[
[1, 0, 0, 0, 0, 0], 
[1, 0, 0, 0, 0, 0], 
[0, 1, 0, 0, 0, 0], 
[0, 0, 1, 0, 0, 0], 
[0, 0, 0, 1, 0, 0]
]

It can also be done via itertools.product, but it won't look as clean as the for loop.

import itertools

string = "AABCD"
alphabet = ['A', 'B', 'C', 'D', 'E', 'F']

string_iter = zip(list(range(len(string))), string)
alphabet_iter = zip(list(range(len(alphabet))), alphabet)

matrix = [[0 for _ in range(len(alphabet))] for _ in range(len(string))]

for (i, s), (j, a) in itertools.product(string_iter, alphabet_iter):
    matrix[i][j] = 1 if s == a else 0

print(matrix)

Upvotes: 1

Related Questions