Jonathan Koren
Jonathan Koren

Reputation: 933

Create boolean dataframe showing existance of each element in a dictionary of lists

I have a dictionary of lists and I have constructed a dataframe where the index is the dictionary keys and the columns are the set of possible values contained within the lists. The dataframe values represent existance of each column for each list contained in the dictionary. What is the most efficient way to construct this? Below is the way I have done it now using for loops, but I am sure there is a more efficient way using either vectorization or concatenation.

import pandas as pd

data = {0:[1,2,3,4],1:[2,3,4],2:[3,4,5,6]}
cols = sorted(list(set([x for y in data.values() for x in y])))
df = pd.DataFrame(0,index=data.keys(),columns=cols)

for row in df.iterrows():
  for col in cols:
    if col in data[row[0]]:
      df.loc[row[0],col] = 1
    else:
      df.loc[row[0],col] = 0

print(df)

Output:

       1  2  3  4  5  6
    0  1  1  1  1  0  0
    1  0  1  1  1  0  0
    2  0  0  1  1  1  1

Upvotes: 1

Views: 30

Answers (1)

jezrael
jezrael

Reputation: 863206

Use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = pd.DataFrame(mlb.fit_transform(data.values()),
                  columns=mlb.classes_,
                  index=data.keys())
print (df)
   1  2  3  4  5  6
0  1  1  1  1  0  0
1  0  1  1  1  0  0
2  0  0  1  1  1  1

Pure pandas, but much slowier solution with str.get_dummies:

df = pd.Series(data).astype(str).str.strip('[]').str.get_dummies(', ')

Upvotes: 3

Related Questions