aabujamra
aabujamra

Reputation: 4636

Turning a dictionary inside of a dataframe into a new dataframe and selecting rows of it

I have a dataframe that looks like this:

    code   size type  type_tops
0   123    5.11 A .   [dictionary]
1   345    5.5  B .   [dictionary]
2   543    6.2  B .   [dictionary]

Those dictionaries look like this:

{'size': {6640: 6.2, 10481: 6.4, 6585: 6.1}, 'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0}, 'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'}, 'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}

I want to create a new column called "highest_guy" and another called "highest_guy_size". Each row has a different dictionary. Let's suppose that code 123 has this dictionary above. The first row should look like this:

    code   size type  type_tops       highest_guy    highest_guy_size
0   123    5.11 A .   [dictionary]    Mark           6.4

I suppose I would need to turn that dict into a dataframe and select the highest guy and his size. I could do that using a for loop, something like this:

for i in df.index:
    tops=pd.DataFrame(df['type_tops][i].sort_values(['size'].reset_index())
    df[highest_guy]=tops['name'][0]
    df['highest_guy_size']=tops['size][0]

But is there a way to make it without a for?

Upvotes: 1

Views: 80

Answers (2)

Allen Qin
Allen Qin

Reputation: 19957

Using Simon's example:

df = pd.DataFrame({'code': [123, 345, 543],
               'size': [5.11, 5.5, 6.2],
               'type': ['A', 'B', 'B'],
               'type_tops': [
                        {'size': {6640: 6.2, 10481: 6.4, 6585: 6.1},
                         'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
                         'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
                         'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
                        {'size': {6640: 6.2, 10481: 6.4, 6585: 6.9},
                         'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
                         'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
                         'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
                        {'size': {6640: 6.7, 10481: 6.4, 6585: 6.1},
                         'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
                         'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
                         'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}}
                       ]
               })

You can first find the highest guy and size:

df_high = (
    df.type_tops.apply(pd.DataFrame)
    .apply(lambda x: x.loc[x['size'].idxmax()])
    .reindex(['name','size'], axis=1)
    .set_axis(['highest_guy','highest_guy_size'], axis=1, inplace=False)
    )

Then join them with the original df:

pd.concat([df,df_high],axis=1)

Out[34]: 
   code  size type                                          type_tops  \
0   123  5.11    A  {'size': {6640: 6.2, 10481: 6.4, 6585: 6.1}, '...   
1   345  5.50    B  {'size': {6640: 6.2, 10481: 6.4, 6585: 6.9}, '...   
2   543  6.20    B  {'size': {6640: 6.7, 10481: 6.4, 6585: 6.1}, '...   

  highest_guy  highest_guy_size  
0        Mark               6.4  
1     Weasley               6.9  
2        John               6.7  

Upvotes: 2

Simon
Simon

Reputation: 10160

You can do this using a groupby and apply.

Theres no need to turn each dict into a separate dataframe - you can just get the id of the tallest individual and use that as a key to get values from the nested name and size dicts:

import pandas as pd

df = pd.DataFrame({'code': [123, 345, 543],
                   'size': [5.11, 5.5, 6.2],
                   'type': ['A', 'B', 'B'],
                   'type_tops': [
                            {'size': {6640: 6.2, 10481: 6.4, 6585: 6.1},
                             'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
                             'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
                             'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
                            {'size': {6640: 6.2, 10481: 6.4, 6585: 6.9},
                             'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
                             'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
                             'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
                            {'size': {6640: 6.7, 10481: 6.4, 6585: 6.1},
                             'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
                             'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
                             'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}}
                           ]
                   })


def extract_vals(df):
    cur_df = df

    # get the dictionary
    cur_dict = dict(cur_df['type_tops'].item())

    # get the key/id of the tallest individual
    highest_id = max(cur_dict['size'], key=lambda k: cur_dict['size'][k])

    # use id to get their name and size
    highest_guy_size = cur_dict['size'][highest_id]
    highest_guy = cur_dict['name'][highest_id]

    cur_df['highest_guy'] = highest_guy
    cur_df['highest_guy_size'] = highest_guy_size

    return cur_df


new = df.groupby('code').apply(lambda x: extract_vals(x))

That will give you an output dataframe that looks like this:

enter image description here

Upvotes: 2

Related Questions