For a pandas dataframe with one column of keys and one column of values, make another column of dictionaries

Question

I have the following dataframe:

    c1  c2          freq
0   a   [u]         [4]
1   b   [x, z, v]   [8, 3, 15]

I want to get another column "dict" such that

    c1  c2          freq         dict
0   a   [u]         [4]          {'u':4}
1   b   [x, z, v]   [8, 3, 15]   {'x':8, 'z':3, 'v':15}

I'm trying this code: d["dict"] = d.apply(lambda row: dict(zip(row["c2"], row["freq"]))) but this gives the error: KeyError: ('c2', u'occurred at index c1')

Not sure what I'm doing wrong. The whole exercise is that I have a global dictionary defined like this: {"u":4, "v":15, "x":8, "z":3} and my initial dataframe is:

    c1  c2
0   a   u
1   b   [x, z, v]

where the [x, z, v] is a numpy array. For each row, I want to retain the top 2 elements (if it's an array) with the highest values from the global dictionary, so for the second row I'll retain x and v. To that end, I converted each element of c2 column into a list, created a new column with their respective frequencies and now want to convert into a dictionary so that I can sort it by values. Then I'll retain the top 2 keys of the dictionary of that row.

d["c2"] = d["c2"].apply(lambda x: list(set(x)))
d["freq"] = d["c2"].apply(lambda x: [c[j] for j in x])
d["dict"] = d.apply(lambda row: dict(zip(row["c2"], row["freq"])))

The third line is causing a problem. Also, if there's a more efficient procedure to do the whole thing, I'd be glad for any advice. Thanks!

Ivan Popov · Accepted Answer

You can solve your core problem more easily by using the key and reverse arguments of the sorted built-in. You siply prepare a partial func and map it over the column along with your preferred subsetting func in method chaining style:

import pandas as pd
from functools import partial

df = pd.DataFrame({'c1': ['a', 'b'], 'c2': ['u', ['x','z','v']]})

c = {"u":4, "v":15, "x":8, "z":3}

sorter = partial(sorted, key=lambda x: c[x], reverse=True)

def subset(l):
    return l[:2]

df['highest_two'] = df['c2'].map(sorter).map(subset)

print(df)

"""
Out:
      c1         c2 highest_two
    0  a          u         [u]
    1  b  [x, z, v]      [v, x]
"""

For a pandas dataframe with one column of keys and one column of values, make another column of dictionaries

Answers (2)

Related Questions