najeem
najeem

Reputation: 1921

Pandas: create a column by mapping from a dict of lists

I have a pandas dataframe and a dict as shown below. I would like to create a new column in the dataframe based on in which list from the dict the index of the row is appearing.

df = pd.DataFrame({
    'a':['x', 'y', 'z'],
    'b':[1, 2, 3],
    'c':[10, 20, 30],
})
df = df.set_index('a')
mydict = {
    'g1':['a', 'b', 'y'],
    'g2':['x', 'k', 'l'],
    'g3':['m', 'l', 'z'],
}

Expected output

   b   c   g
a       
x  1  10  g2
y  2  20  g1
z  3  30  g3

I don't even know how to begin solving this. May be create a custom function and map using that? But will that be performant enough? In my actual use case, there may be hundreds of thousands of rows in 'df' and around hundred keys in 'mydict' and each list will contain tens of thousands of items. Also, I will mostly be dealing with integers instead of strings as shown here.

Upvotes: 2

Views: 2659

Answers (2)

noah
noah

Reputation: 2776

This loops through the index values and looks them up in the values of the dictionary. It runs ~7-10x faster** than Alollz solution even though it doesn't look quite as pretty.

g = []
for i in df.index.values:
    for k in mydict:
        if i in mydict[k]:
            g.append(k)
            break
df['g'] = g

** faster for small sizes of data like the example in the question. As data amounts increase Alollz's solution will become faster. Alollz's creation of the value:key dict has a lot of overhead so for small amounts of data this is faster.

Upvotes: 2

ALollz
ALollz

Reputation: 59549

You need to flatten your dictionary of lists into a simple dictionary with scalar key: value pairs. Since dictionaries require unique keys, if multiple lists contain the same value, that value ultimately gets mapped to the key of the last list it belongs to (which is fine since you don't care which).

d = {val:key for key, lst in mydict.items() for val in lst}
#{'a': 'g1', 'b': 'g1', 'y': 'g1',
# 'x': 'g2', 'k': 'g2', 
# 'l': 'g3', 'm': 'g3', 'z': 'g3'}

df['g'] = df.index.map(d)
#   b   c   g
#a           
#x  1  10  g2
#y  2  20  g1
#z  3  30  g3

Upvotes: 5

Related Questions