Reputation: 1272
My question is how to simplify the following code. In my real data I would have to add 540 columns, but I guess there is a better way, especially for generating the columns. Maybe even separate dataframes?
Below you'll see the test-df I have with one column needed. 'bin_X_0' through 'bin_X_9' need to be generated, but then several, i.e. bin_Y_0, bin_Z_0 etc. through 9.
N = 10000
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h', 'd', 'a']
S = ['AR1','PO1','RU1']
np.random.seed(0)
df = pd.DataFrame(
{'Y':np.random.uniform(1,10,N),
'X':np.random.uniform(1,10,N),
'Z':np.random.uniform(1,10,N),
'J':np.random.choice(J,N),
'S':np.random.choice(S,N),
'R':np.random.choice(L,N)
})
df['bins_X'] = df.groupby('S').X.apply(pd.qcut, q=10, labels=np.arange(10))
df['bin_X_0'] = np.where((df['bins_X'] ==0) & (df['R'] =='a'), (df['X']*2)-2,
np.where((df['bins_X'] ==0) & (df['R'] !='a'), -2, 0))
df.head()
Upvotes: 0
Views: 101
Reputation: 2859
You can achieve this with a double for loop, by using current iterables and format
function for the new column generation. The following code should do it:
for a in ["X", "Y", "Z"]:
for num in list(range(0, 10)):
df['bins_{}'.format(a)] = df.groupby('S')["{}".format(a)].apply(pd.qcut, q=10, labels=np.arange(10))
df['bin_{}_{}'.format(a, num)] = np.where((df['bins_{}'.format(a)] == num) & (df['R'] == 'a'),
(df['{}'.format(a)] * 2) - 2,
np.where((df['bins_{}'.format(a)] == num) & (df['R'] != 'a'), -2, 0))
I hope it is clear, how current numbers and letters are dynamically picking up the appropriate data and generating two new columns inside every nested loop. The above code produces this dataframe, which I believe is what you are looking for:
J R S X Y Z bins_X bin_X_0 bin_X_1 \
0 2014 d PO1 7.734412 5.939322 4.529557 7 0.000000 0.0
1 2014 h AR1 2.621824 7.436704 1.370409 1 0.000000 -2.0
2 2013 h PO1 4.501208 6.424870 9.309705 4 0.000000 0.0
3 2012 a RU1 1.338402 5.903949 4.656115 0 0.676803 0.0
4 2012 d RU1 1.106090 4.812893 9.498540 0 -2.000000 0.0
bin_X_2 ... bin_Z_0 bin_Z_1 bin_Z_2 bin_Z_3 bin_Z_4 bin_Z_5 \
0 0.0 ... 0.0 0.0 0.0 -2.0 0.00000 0.0
1 0.0 ... -2.0 0.0 0.0 0.0 0.00000 0.0
2 0.0 ... 0.0 0.0 0.0 0.0 0.00000 0.0
3 0.0 ... 0.0 0.0 0.0 0.0 7.31223 0.0
4 0.0 ... 0.0 0.0 0.0 0.0 0.00000 0.0
bin_Z_6 bin_Z_7 bin_Z_8 bin_Z_9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 -2.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 -2.0
I hope you get the idea.
Upvotes: 1
Reputation: 2428
the best way to add new columns if they are related to the data already in the DataFrame is to use pd.concat or pd.merge.
You can read more about it on http://pandas.pydata.org/pandas-docs/stable/merging.html
Upvotes: 0