Reputation: 2533
What's the easiest way to get from an array of strings like this:
arr = ['abc def ghi', 'def jkl xyz', 'abc xyz', 'jkl xyz']
to a dataframe where each column is a single word and each row contains 0 or 1 depending if the word appeared in the string. Something like this:
abc def ghi jkl xyz
0 1 1 1 0 0
1 0 1 0 1 1
2 1 0 0 0 1
3 0 0 0 1 1
EDIT: Here is my approach, which to me seemed like a lot of python looping and not using the built in pandas functions
labels = (' ').join(arr)
labels = labels.split()
labels = list(set(labels))
labels = sorted(labels)
df = pd.DataFrame(np.zeros((len(arr), len(labels))), columns=labels)
cols = list(df.columns.values)
for i in range(len(arr)):
for col in cols:
if col in arr[i]:
df.set_value(i, col, 1)
Upvotes: 0
Views: 1209
Reputation: 303
EDITED - reduced to 3 essential lines:
import pandas as pd
arr = ['abc def ghi', 'def jkl xyz', 'abc xyz', 'jkl xyz']
words = set( ' '.join( arr ).split() )
rows = [ { w : int( w in e ) for w in words } for e in arr ]
df = pd.DataFrame( rows )
print( df )
Result:
abc def ghi jkl xyz
0 1 1 1 0 0
1 0 1 0 1 1
2 1 0 0 0 1
3 0 0 0 1 1
Upvotes: 3