Create boolean dataframe showing existance of each element in a dictionary of lists

Question

I have a dictionary of lists and I have constructed a dataframe where the index is the dictionary keys and the columns are the set of possible values contained within the lists. The dataframe values represent existance of each column for each list contained in the dictionary. What is the most efficient way to construct this? Below is the way I have done it now using for loops, but I am sure there is a more efficient way using either vectorization or concatenation.

import pandas as pd

data = {0:[1,2,3,4],1:[2,3,4],2:[3,4,5,6]}
cols = sorted(list(set([x for y in data.values() for x in y])))
df = pd.DataFrame(0,index=data.keys(),columns=cols)

for row in df.iterrows():
  for col in cols:
    if col in data[row[0]]:
      df.loc[row[0],col] = 1
    else:
      df.loc[row[0],col] = 0

print(df)

Output:

       1  2  3  4  5  6
    0  1  1  1  1  0  0
    1  0  1  1  1  0  0
    2  0  0  1  1  1  1

jezrael · Accepted Answer

Use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = pd.DataFrame(mlb.fit_transform(data.values()),
                  columns=mlb.classes_,
                  index=data.keys())
print (df)
   1  2  3  4  5  6
0  1  1  1  1  0  0
1  0  1  1  1  0  0
2  0  0  1  1  1  1

Pure pandas, but much slowier solution with str.get_dummies:

df = pd.Series(data).astype(str).str.strip('[]').str.get_dummies(', ')

Create boolean dataframe showing existance of each element in a dictionary of lists

Answers (1)

Related Questions