vishal sharma
vishal sharma

Reputation: 13

Convert python dictionary to dataframe with dict values(list) as columns and 1,0 if that column is in dict list

I want to create a dataframe from a dictionary which is of the format

Dictionary_ =  {'Key1': ['a', 'b', 'c', 'd'],'Key2': ['d', 'f'],'Key3': ['a', 'c', 'm', 'n']}

I am using

df = pd.DataFrame.from_dict(Dictionary_, orient ='index')

But it creates its own columns till max length of values and put values of dictionary as values in a dataframe.

I want a df with keys as rows and values as columns like

       a     b      c     d     e     f    m     n 
Key 1  1      1      1    1     0    0    0     0
Key 2  0      0      0    1     0    1    0     0
Key 3  1      0      1    0     0    0    1     1

I can do it by appending all values of dict and create an empty dataframe with dict keys as rows and values as columns and then iterating over each row to fetch values from dict and put 1 where it matches with column, but this will be too slow as my data has 200 000 rows and .loc is slow. I feel i can use pandas dummies somehow but don't know how to apply it here.

I feel there will be a smarter way to do this.

Upvotes: 1

Views: 1809

Answers (2)

U13-Forward
U13-Forward

Reputation: 71610

Use get_dummies:

>>> pd.get_dummies(df).rename(columns=lambda x: x[2:]).max(axis=1, level=0)
      a  d  b  c  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  1  0  0  1  0  0
Key3  1  0  0  1  0  1  1
>>> 

Upvotes: 1

jezrael
jezrael

Reputation: 863226

If performance is important, use MultiLabelBinarizer and pass keys and values:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(Dictionary_.values()),
                  columns=mlb.classes_, 
                  index=Dictionary_.keys()))
print (df)
      a  b  c  d  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  0  0  1  1  0  0
Key3  1  0  1  0  0  1  1

Alternative, but slowier is create Series, then str.join for strings and last call str.get_dummies:

df = pd.Series(Dictionary_).str.join('|').str.get_dummies()
print (df)
      a  b  c  d  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  0  0  1  1  0  0
Key3  1  0  1  0  0  1  1

Alternative with input DataFrame - use pandas.get_dummies, but then is necessary aggregate max per columns:

df1 = pd.DataFrame.from_dict(Dictionary_, orient ='index')

df = pd.get_dummies(df1, prefix='', prefix_sep='').max(axis=1, level=0)
print (df)
      a  d  b  c  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  1  0  0  1  0  0
Key3  1  0  0  1  0  1  1

Upvotes: 3

Related Questions