The Great
The Great

Reputation: 7703

Split a tuple within a dict and convert into dataframe

I have a dataframe like as shown below

td = {966: [('Feat1', -0.04),
  ('Feat2=True ', -0.02),
  ('Feat3 <= 20000.00', 0.01),
  ('Feat4=Power Supply', -0.01),
  ('Feat5=dada', -0.0)],
 879: [('Feat8=Rare', 0.02),
  ('Feat11=HV', -0.01),
  ('Feat21=Power Supply', -0.01),
  ('20000.00 < Feat3 <= 50000.00', 0.01),
  ('Feat5=dada', -0.01)]}

I would like to do the below

a) Split the tuple within dict based on , comma seperator

b) store the numeric part in value column of dataframe and text part in feature column of dataframe

c) repeat the key values for all values in dataframe (and store it in key column)

I tried the below but it is not efficient/elegant and doesn't scale for big data of million rows

feature=[]
value=[]
key=[]
for k, v in td.items():
    for x in v:
        key.append(k)
        f, v  = x
        feature.append(f)
        value.append(v)
data_tuples = list(zip(key,feature,value))
pd.DataFrame(data_tuples, columns=['key','feature','value'])

I expect my output to be like as shown below

enter image description here

Upvotes: 1

Views: 245

Answers (2)

jezrael
jezrael

Reputation: 862591

Use generator comprehension with flatten values and pass to DataFrame constructor:

df = pd.DataFrame()(k,b,c) for k, v in td.items() for b, c in v), 
                  columns=['key','feature','value'])
print (df)
   key                       feature  value
0  966                         Feat1  -0.04
1  966                   Feat2=True   -0.02
2  966             Feat3 <= 20000.00   0.01
3  966            Feat4=Power Supply  -0.01
4  966                    Feat5=dada  -0.00
5  879                    Feat8=Rare   0.02
6  879                     Feat11=HV  -0.01
7  879           Feat21=Power Supply  -0.01
8  879  20000.00 < Feat3 <= 50000.00   0.01
9  879                    Feat5=dada  -0.01

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148890

You can even use a generator comprehension for the data to avoid building a full list in memory:

pd.DataFrame(([k, elt[0], elt[1]] for  k,v in td.items() for elt in v),
             columns = ['key', 'Feature', 'Value'])

   key                       Feature  Value
0  966                         Feat1  -0.04
1  966                   Feat2=True   -0.02
2  966             Feat3 <= 20000.00   0.01
3  966            Feat4=Power Supply  -0.01
4  966                    Feat5=dada  -0.00
5  879                    Feat8=Rare   0.02
6  879                     Feat11=HV  -0.01
7  879           Feat21=Power Supply  -0.01
8  879  20000.00 < Feat3 <= 50000.00   0.01
9  879                    Feat5=dada  -0.01

Upvotes: 2

Related Questions