pietz
pietz

Reputation: 2533

Array of strings to dataframe with word columns

What's the easiest way to get from an array of strings like this:

arr = ['abc def ghi', 'def jkl xyz', 'abc xyz', 'jkl xyz']

to a dataframe where each column is a single word and each row contains 0 or 1 depending if the word appeared in the string. Something like this:

   abc def ghi jkl xyz
0    1   1   1   0   0
1    0   1   0   1   1
2    1   0   0   0   1
3    0   0   0   1   1

EDIT: Here is my approach, which to me seemed like a lot of python looping and not using the built in pandas functions

labels = (' ').join(arr)
labels = labels.split()
labels = list(set(labels))
labels = sorted(labels)

df = pd.DataFrame(np.zeros((len(arr), len(labels))), columns=labels)
cols = list(df.columns.values)

for i in range(len(arr)):
    for col in cols:
        if col in arr[i]:
            df.set_value(i, col, 1)

Upvotes: 0

Views: 1209

Answers (1)

TLousky
TLousky

Reputation: 303

EDITED - reduced to 3 essential lines:

import pandas as pd

arr = ['abc def ghi', 'def jkl xyz', 'abc xyz', 'jkl xyz']

words = set( ' '.join( arr ).split() )
rows  = [ { w : int( w in e ) for w in words } for e in arr ]
df    = pd.DataFrame( rows )

print( df )

Result:

   abc  def  ghi  jkl  xyz
0    1    1    1    0    0
1    0    1    0    1    1
2    1    0    0    0    1
3    0    0    0    1    1

Upvotes: 3

Related Questions