Reputation: 4636
I have a dataframe that looks like this:
code size type type_tops
0 123 5.11 A . [dictionary]
1 345 5.5 B . [dictionary]
2 543 6.2 B . [dictionary]
Those dictionaries look like this:
{'size': {6640: 6.2, 10481: 6.4, 6585: 6.1}, 'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0}, 'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'}, 'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}
I want to create a new column called "highest_guy" and another called "highest_guy_size". Each row has a different dictionary. Let's suppose that code 123 has this dictionary above. The first row should look like this:
code size type type_tops highest_guy highest_guy_size
0 123 5.11 A . [dictionary] Mark 6.4
I suppose I would need to turn that dict into a dataframe and select the highest guy and his size. I could do that using a for loop, something like this:
for i in df.index:
tops=pd.DataFrame(df['type_tops][i].sort_values(['size'].reset_index())
df[highest_guy]=tops['name'][0]
df['highest_guy_size']=tops['size][0]
But is there a way to make it without a for?
Upvotes: 1
Views: 80
Reputation: 19957
Using Simon's example:
df = pd.DataFrame({'code': [123, 345, 543],
'size': [5.11, 5.5, 6.2],
'type': ['A', 'B', 'B'],
'type_tops': [
{'size': {6640: 6.2, 10481: 6.4, 6585: 6.1},
'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
{'size': {6640: 6.2, 10481: 6.4, 6585: 6.9},
'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
{'size': {6640: 6.7, 10481: 6.4, 6585: 6.1},
'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}}
]
})
You can first find the highest guy and size:
df_high = (
df.type_tops.apply(pd.DataFrame)
.apply(lambda x: x.loc[x['size'].idxmax()])
.reindex(['name','size'], axis=1)
.set_axis(['highest_guy','highest_guy_size'], axis=1, inplace=False)
)
Then join them with the original df:
pd.concat([df,df_high],axis=1)
Out[34]:
code size type type_tops \
0 123 5.11 A {'size': {6640: 6.2, 10481: 6.4, 6585: 6.1}, '...
1 345 5.50 B {'size': {6640: 6.2, 10481: 6.4, 6585: 6.9}, '...
2 543 6.20 B {'size': {6640: 6.7, 10481: 6.4, 6585: 6.1}, '...
highest_guy highest_guy_size
0 Mark 6.4
1 Weasley 6.9
2 John 6.7
Upvotes: 2
Reputation: 10160
You can do this using a groupby and apply.
Theres no need to turn each dict into a separate dataframe - you can just get the id of the tallest individual and use that as a key to get values from the nested name
and size
dicts:
import pandas as pd
df = pd.DataFrame({'code': [123, 345, 543],
'size': [5.11, 5.5, 6.2],
'type': ['A', 'B', 'B'],
'type_tops': [
{'size': {6640: 6.2, 10481: 6.4, 6585: 6.1},
'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
{'size': {6640: 6.2, 10481: 6.4, 6585: 6.9},
'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}},
{'size': {6640: 6.7, 10481: 6.4, 6585: 6.1},
'Speed': {6640: 119685000.0, 10481: 145793000.0, 6585: 200021000.0},
'type': {6640: '62.0', 10481: '62.0', 6585: '62.0'},
'name': {6640: 'John', 10481: 'Mark', 6585: 'Weasley'}}
]
})
def extract_vals(df):
cur_df = df
# get the dictionary
cur_dict = dict(cur_df['type_tops'].item())
# get the key/id of the tallest individual
highest_id = max(cur_dict['size'], key=lambda k: cur_dict['size'][k])
# use id to get their name and size
highest_guy_size = cur_dict['size'][highest_id]
highest_guy = cur_dict['name'][highest_id]
cur_df['highest_guy'] = highest_guy
cur_df['highest_guy_size'] = highest_guy_size
return cur_df
new = df.groupby('code').apply(lambda x: extract_vals(x))
That will give you an output dataframe that looks like this:
Upvotes: 2