Reputation: 119
I have a pandas dataframe called: self.data
They have two columns: name
and value
and I want a new one to be generated with a dictionary.
For example:
Name | Value | New Dict Column |
---|---|---|
[a, b, c, a] | [1, 2, 3, 4] | {a: [1, 4], b: [2], c: [3]} |
[b, b, a] | [1, 2, 3] | {b: [1, 2], a: [3] } |
At this moment I have the following code:
data['dict'] = self.data[['name', 'value']].apply(lambda x: dict(zip(*x)), axis=1)
The problem with this attempt is that the pair name, value is being always replaced. Using the example, I can't save both a1 and a2. The final dictionary only stores the last one.
Thank you in advance!
Upvotes: 1
Views: 1280
Reputation: 862521
Use custom function with defaultdict
if performance is important:
from collections import defaultdict
def f(x):
d = defaultdict(list)
for y, z in zip(*x):
d[y].append(z)
return d
df['New Dict Column'] = [ f(x) for x in df[['column1','column2']].to_numpy()]
print(df)
column1 column2 New Dict Column
0 [a, b, c, a] [1, 2, 3, 4] {'a': [1, 4], 'b': [2], 'c': [3]}
1 [b, b, a] [1, 2, 3] {'b': [1, 2], 'a': [3]}
Performance is really good, 10 times faster:
#20k rows for test
df = pd.concat([df] * 10000, ignore_index=True)
In [211]: %timeit df.apply(lambda data: {k: [y for x, y in zip(data[0], data[1]) if x == k] for k in data[0]}, axis=1)
532 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [212]: %timeit [ f(x) for x in df[['column1','column2']].to_numpy()]
53.8 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 71570
Try something like this with apply
:
df['New Dict Column'] = df.apply(lambda data: {k: [y for x, y in zip(data[0], data[1]) if x == k] for k in data[0]}, axis=1)
print(df)
Output:
Name Value New Dict Column
0 [a, b, c, a] [1, 2, 3, 4] {'a': [1, 4], 'b': [2], 'c': [3]}
1 [b, b, a] [1, 2, 3] {'b': [1, 2], 'a': [3]}
Upvotes: 3
Reputation: 120
You can use apply for several columns with the next pattern:
import pandas as pd
df = pd.DataFrame({'Name' :[['a', 'b', 'c', 'a'], ['b', 'b', 'a']],
'Value' :[['a1', 'b1', 'c1', 'a2'], ['b1', 'b2', 'a1']]})
print(df)
def get_dict(row):
my_dict = {}
for x in row['Name']:
my_dict[x] = row['Value']
return my_dict
df['my_dict'] = df.apply(get_dict, axis=1)
print(df)
PS: take into account that I have not define correctly the way to extract the right elements from Value to be mapped to the right element of Name. You will need to implement that part of the code.
Upvotes: 0