ShanZhengYang
ShanZhengYang

Reputation: 17631

Transforming pandas Dataframe into dictionary via function taking column inputs

I have the following pandas Dataframe:

dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}

df1 = pd.DataFrame(dict1)
print(df1)

        file  amount     front      back
0  filename2       3  21889611  21973805
1  filename2       4  36357723  36403870
2  filename3       5    196312    277500
3  filename4       1        11        19
4  filename4       2        42       120
5  filename3       1      1992      3210

My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.

To do this on an row-by-row basis is easy for me to understand:

e.g. row 1

import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)

e.g. row 2

random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)

Normally with pandas, users could define this as a function and use something like

def func(front, back, amount):
    return np.random.choice(np.arange(front, back+1), amount)

df["new_column"].apply(func)

but the result of my function is an array of varying size.

My second problem is that I would like the output to be a dictionary, of the format

{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}

For the above example df1, the function should output this dictionary (given the draws):

final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833], 
    "filename3": 212143, 239725, 240959, 197359, 276948, 3199], 
    "filename4": [100,  83, 15]}

Upvotes: 1

Views: 46

Answers (1)

Anton vBR
Anton vBR

Reputation: 18916

We can pass axis=1 to operate over rows when using apply. We then need to tell what columns to use and we return a list.

We then either perform some form of groupby or we could use defaultdict as shown below:

dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}

import numpy as np
import pandas as pd

def func(x):
    return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()

df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)

df1.groupby('file')['new_column'].apply(sum).to_dict()

Returns:

{'filename2': [21891765,
  21904680,
  21914414,
  36398355,
  36358161,
  36387670,
  36369443],
 'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
 'filename4': [18, 96, 107]}

Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):

from collections import defaultdict

d = defaultdict(list)

for k,v in df1.set_index('file')['new_column'].items():
    d[k].extend(v)

Upvotes: 1

Related Questions