A dictionary in a Pandas dataframe column in Python

Question

I am reading a csv file that a column contains a multi keys dict. Here is an example:

import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[{'AUS': {'arv': '10:00', 'vol': 5}, 'DAL': {'arv': '9:00', 'vol': 1}}, {'DAL': {'arv': '10:00', 'vol': 6}, 'NYU': {'arv': '10:00', 'vol': 3}}, {'DAL': {'arv': '8:00', 'vol': 6}, 'DAL': {'arv': '10:00', 'vol': 1}, 'GBD': {'arv': '12:00', 'vol': 1}}]})

What I am trying to do is perform a query on the column b of the above dataframe and return the corresponding values as presented in the following. However, I am trying to see if there is a more intuitive and more efficient way to perform similar operations in a large dataset without looping through the dict.

#convert column b of df to a dict
df_dict = df.b.to_dict()
print(df_dict)
{0: {'AUS': {'arv': '10:00', 'vol': 5}, 'DAL': {'arv': '9:00', 'vol': 1}}, 1: {'DAL': {'arv': '10:00', 'vol': 6}, 'NYU': {'arv': '10:00', 'vol': 3}}, 2: {'DAL': {'arv': '10:00', 'vol': 1}, 'GBD': {'arv': '12:00', 'vol': 1}}}

def get_value(my_str, my_time):
    total = 0
    for key in df_dict:
        if my_str in df_dict[key].keys():
            if df_dict[key].get(my_str).get('arv') == my_time:
                total = total + df_dict[key].get(my_str).get('vol')
    return total

print("total vol is at 10:00 is: ", get_value('DAL', '10:00'))
total vol is at 10:00 is:  7

O.Laprevote · Accepted Answer

While dukkee's answer works, I believe if you want to manipulate your dataframe in other ways his organization is a bit counterintuitive. I would also reorganize the dataframe, though this way:

input_data = {
    'a':[1,2,3], 
    'b':[{'AUS': {'arv': '10:00', 'vol': 5},
         'DAL': {'arv': '9:00', 'vol': 1}
        },
        {'DAL': {'arv': '10:00', 'vol': 6},
         'NYU': {'arv': '10:00', 'vol': 3}
        },
        {'DAL': {'arv': '8:00', 'vol': 6},
         'DAL': {'arv': '10:00', 'vol': 1},
         'GBD': {'arv': '12:00', 'vol': 1}
        }]
}

data_list = [[input_data['a'][i], key, value['arv'], value['vol']]
            for i, dic in enumerate(input_data['b'])
            for key, value in dic.items()]
df = pd.DataFrame(data_list, columns=['a', 'abr', 'arv', 'vol'])

Which results in:

>>> df
   a  abr    arv  vol
0  1  AUS  10:00    5
1  1  DAL   9:00    1
2  2  DAL  10:00    6
3  2  NYU  10:00    3
4  3  DAL  10:00    1
5  3  GBD  12:00    1

I believe that's the way you should organize your data. Having dictionaries as values in a dataframe seems counterintuitive to me. This way you can use loc to solve your problem:

>>> df.loc[(df['arv']=='10:00') & (df['abr']=='DAL')]
   a  abr    arv  vol
2  2  DAL  10:00    6
4  3  DAL  10:00    1
>>> vol_sum = sum(df.loc[(df['arv']=='10:00') & (df['abr']=='DAL')]['vol'])
>>> print(f"total vol at 10:00 is: {vol_sum}")
"total vol at 10:00 is: 7"

Little plus compared to dukkee: no need to use collections, and list comprehensions are faster than for-loops! Note that in one of your dictionaries you have two times 'DAL' as a key, so the first one gets erased.

A dictionary in a Pandas dataframe column in Python

Answers (2)

Related Questions