Reputation: 17631
I have the following pandas Dataframe
:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
df1 = pd.DataFrame(dict1)
print(df1)
file amount front back
0 filename2 3 21889611 21973805
1 filename2 4 36357723 36403870
2 filename3 5 196312 277500
3 filename4 1 11 19
4 filename4 2 42 120
5 filename3 1 1992 3210
My task is to take N
random draws between front
and back
, whereby N
is equal to the value in amount
. Parse this into a dictionary.
To do this on an row-by-row basis is easy for me to understand:
e.g. row 1
import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)
e.g. row 2
random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)
Normally with pandas, users could define this as a function and use something like
def func(front, back, amount):
return np.random.choice(np.arange(front, back+1), amount)
df["new_column"].apply(func)
but the result of my function is an array of varying size.
My second problem is that I would like the output to be a dictionary, of the format
{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}
For the above example df1
, the function should output this dictionary (given the draws):
final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833],
"filename3": 212143, 239725, 240959, 197359, 276948, 3199],
"filename4": [100, 83, 15]}
Upvotes: 1
Views: 46
Reputation: 18916
We can pass axis=1 to operate over rows when using apply. We then need to tell what columns to use and we return a list.
We then either perform some form of groupby or we could use defaultdict as shown below:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
import numpy as np
import pandas as pd
def func(x):
return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()
df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)
df1.groupby('file')['new_column'].apply(sum).to_dict()
Returns:
{'filename2': [21891765,
21904680,
21914414,
36398355,
36358161,
36387670,
36369443],
'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
'filename4': [18, 96, 107]}
Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):
from collections import defaultdict
d = defaultdict(list)
for k,v in df1.set_index('file')['new_column'].items():
d[k].extend(v)
Upvotes: 1