Zanshin
Zanshin

Reputation: 1272

create new columns or dataframe with a function?

My question is how to simplify the following code. In my real data I would have to add 540 columns, but I guess there is a better way, especially for generating the columns. Maybe even separate dataframes?

Below you'll see the test-df I have with one column needed. 'bin_X_0' through 'bin_X_9' need to be generated, but then several, i.e. bin_Y_0, bin_Z_0 etc. through 9.

N = 10000
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h', 'd', 'a'] 
S = ['AR1','PO1','RU1']

np.random.seed(0)

df = pd.DataFrame(
    {'Y':np.random.uniform(1,10,N),
     'X':np.random.uniform(1,10,N),
     'Z':np.random.uniform(1,10,N),
     'J':np.random.choice(J,N),
     'S':np.random.choice(S,N),
     'R':np.random.choice(L,N)
    })

df['bins_X'] = df.groupby('S').X.apply(pd.qcut, q=10, labels=np.arange(10))

df['bin_X_0'] = np.where((df['bins_X'] ==0) & (df['R'] =='a'), (df['X']*2)-2, 
                        np.where((df['bins_X'] ==0) & (df['R'] !='a'), -2, 0))

df.head()

enter image description here

Upvotes: 0

Views: 101

Answers (2)

Marjan Moderc
Marjan Moderc

Reputation: 2859

You can achieve this with a double for loop, by using current iterables and format function for the new column generation. The following code should do it:

for a in ["X", "Y", "Z"]:
    for num in list(range(0, 10)):
        df['bins_{}'.format(a)] = df.groupby('S')["{}".format(a)].apply(pd.qcut, q=10, labels=np.arange(10))

        df['bin_{}_{}'.format(a, num)] = np.where((df['bins_{}'.format(a)] == num) & (df['R'] == 'a'),
                                                  (df['{}'.format(a)] * 2) - 2,
                                                  np.where((df['bins_{}'.format(a)] == num) & (df['R'] != 'a'), -2, 0))

I hope it is clear, how current numbers and letters are dynamically picking up the appropriate data and generating two new columns inside every nested loop. The above code produces this dataframe, which I believe is what you are looking for:

 J  R    S         X         Y         Z bins_X   bin_X_0  bin_X_1  \
0  2014  d  PO1  7.734412  5.939322  4.529557      7  0.000000      0.0   
1  2014  h  AR1  2.621824  7.436704  1.370409      1  0.000000     -2.0   
2  2013  h  PO1  4.501208  6.424870  9.309705      4  0.000000      0.0   
3  2012  a  RU1  1.338402  5.903949  4.656115      0  0.676803      0.0   
4  2012  d  RU1  1.106090  4.812893  9.498540      0 -2.000000      0.0   

   bin_X_2   ...     bin_Z_0  bin_Z_1  bin_Z_2  bin_Z_3  bin_Z_4  bin_Z_5  \
0      0.0   ...         0.0      0.0      0.0     -2.0  0.00000      0.0   
1      0.0   ...        -2.0      0.0      0.0      0.0  0.00000      0.0   
2      0.0   ...         0.0      0.0      0.0      0.0  0.00000      0.0   
3      0.0   ...         0.0      0.0      0.0      0.0  7.31223      0.0   
4      0.0   ...         0.0      0.0      0.0      0.0  0.00000      0.0   

   bin_Z_6 bin_Z_7  bin_Z_8  bin_Z_9  
0      0.0     0.0      0.0      0.0  
1      0.0     0.0      0.0      0.0  
2      0.0     0.0      0.0     -2.0  
3      0.0     0.0      0.0      0.0  
4      0.0     0.0      0.0     -2.0 

I hope you get the idea.

Upvotes: 1

epattaro
epattaro

Reputation: 2428

the best way to add new columns if they are related to the data already in the DataFrame is to use pd.concat or pd.merge.

You can read more about it on http://pandas.pydata.org/pandas-docs/stable/merging.html

Upvotes: 0

Related Questions