Reputation: 1921
I have a pandas dataframe and a dict as shown below. I would like to create a new column in the dataframe based on in which list from the dict the index of the row is appearing.
df = pd.DataFrame({
'a':['x', 'y', 'z'],
'b':[1, 2, 3],
'c':[10, 20, 30],
})
df = df.set_index('a')
mydict = {
'g1':['a', 'b', 'y'],
'g2':['x', 'k', 'l'],
'g3':['m', 'l', 'z'],
}
Expected output
b c g
a
x 1 10 g2
y 2 20 g1
z 3 30 g3
I don't even know how to begin solving this. May be create a custom function and map using that? But will that be performant enough? In my actual use case, there may be hundreds of thousands of rows in 'df' and around hundred keys in 'mydict' and each list will contain tens of thousands of items. Also, I will mostly be dealing with integers instead of strings as shown here.
Upvotes: 2
Views: 2659
Reputation: 2776
This loops through the index values and looks them up in the values of the dictionary. It runs ~7-10x faster** than Alollz solution even though it doesn't look quite as pretty.
g = []
for i in df.index.values:
for k in mydict:
if i in mydict[k]:
g.append(k)
break
df['g'] = g
** faster for small sizes of data like the example in the question. As data amounts increase Alollz's solution will become faster. Alollz's creation of the value:key dict has a lot of overhead so for small amounts of data this is faster.
Upvotes: 2
Reputation: 59549
You need to flatten your dictionary of lists into a simple dictionary with scalar key: value pairs. Since dictionaries require unique keys, if multiple lists contain the same value, that value ultimately gets mapped to the key of the last list it belongs to (which is fine since you don't care which).
d = {val:key for key, lst in mydict.items() for val in lst}
#{'a': 'g1', 'b': 'g1', 'y': 'g1',
# 'x': 'g2', 'k': 'g2',
# 'l': 'g3', 'm': 'g3', 'z': 'g3'}
df['g'] = df.index.map(d)
# b c g
#a
#x 1 10 g2
#y 2 20 g1
#z 3 30 g3
Upvotes: 5