Optimize Python code for faster processing

Question

I need some help/suggestions/guidance on how I can optimize my code. The code works, but with huge data it has been running for almost a day. My data has ~ 2 million rows , with sample data ( few thousdand rows) it works .My sample data format is show below:

index   A    B
0   0.163   0.181
1   0.895   0.093
2   0.947   0.545
3   0.435   0.307
4   0.021   0.152
5   0.486   0.977
6   0.291   0.244
7   0.128   0.946
8   0.366   0.521
9   0.385   0.137
10  0.950   0.164
11  0.073   0.541
12  0.917   0.711
13  0.504   0.754
14  0.623   0.235
15  0.845   0.150
16  0.847   0.336
17  0.009   0.940
18  0.328   0.302

What I want to do : Given the above data set I want to bucket/bin each row into different buckets/bins based on values of A and B.Each index can only lie in one bin . To do this I have discretized A and B from 0 to 1( step size of 0.1). My bins for A look like this:

listA = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]

similar for B.

listB = [0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0]

So total I have 10 * 10 = 100 bin So in total there are 100 bins , bin1 = (A,B) = (0,0) , bin 2 = (0,0.1) , bin 3 = (0,0.2)....bin 10 = (0,1), bin 11 = (0.1,0).....bin 20 = (0.1,1) ..... bin(100) = (1,1) Then for each index, I am checking which bin each index lies in running a for loop shown below :

for index in df.index:
  sumlist = []
  for A in listA:
    for B in listB:
      filt_data = df[(df['A'] > A) & (df['A'] < A) & (df['B'] > B) & (df_input['B'] < B)]
      data_len = len(filt_data)
      sumlist = sumlist.append(data_len)
      df_sumlist = pd.DataFrame([sumlist])
   df_output = pd.concat([df_output , df_sumlist ] , axis = 0)

I tried using the pandas cut function for binning but it appears that it works for one column.

Expected output

index   A         B    bin1   bin2 bin3 bin4 bin5 ...bin 23.. bin100    
    0   0.163   0.181   0      0     0   0    0           1     0
    1   0.895   0.093
    2   0.947   0.545
    3   0.435   0.307
    4   0.021   0.152
    5   0.486   0.977
    6   0.291   0.244
    7   0.128   0.946
    8   0.366   0.521
    9   0.385   0.137
    10  0.950   0.164
    11  0.073   0.541
    12  0.917   0.711
    13  0.504   0.754
    14  0.623   0.235
    15  0.845   0.150
    16  0.847   0.336
    17  0.009   0.940
    18  0.328   0.302

I do care about other bins even if they are zero, for eg: index 0 might lie in bin 23 so for index 0 I will have 1 in bin 23 and 0 in all other 99 bins. Similarly for index 1, it might lie in bin 91 , so expected to have 1 in bin 91 and all bins 0 for index.

Thanks for taking the time to read and help me with this, appreciate your help. Please let me know if I am missing anything or need to clarify things.

ozacha · Accepted Answer

You were on the right track! pd.cut is the way to go. I'm using the Series categories to create your final bins:

import pandas as pd
import numpy as np

# Generate sample df
df = pd.DataFrame({'A': np.random.uniform(size=20), 'B': np.random.uniform(size=20)})

# Create bins for each column
df["bin_A"] = pd.cut(df["A"], bins=np.linspace(0, 1, 11))
df["bin_B"] = pd.cut(df["B"], bins=np.linspace(0, 1, 11))

# Create a combined bin using category codes for each binned column
df["combined_bin"] = df["bin_A"].cat.codes * 10 + df["bin_B"].cat.codes
df["combined_bin"] = pd.Categorical(df["combined_bin"], categories=range(100))

# Loop over categories to create new columns
for i in df["combined_bin"].cat.categories:
    df[f"bin_{i}"] = (df["combined_bin"] == i).astype(int)

EDIT – Generalized solution: The important part here is defining all possible combinations of bins in both columns, using itertools.product:

import pandas as pd
import numpy as np
import itertools

df = pd.DataFrame({'A': np.random.uniform(size=20), 'B': np.random.uniform(size=20)})

# Change number of bins here or update the `bins` parameter
N_BINS_A = 10
N_BINS_B = 10
df["bin_A"] = pd.cut(df["A"], bins=np.linspace(0, 1, N_BINS_A + 1))
df["bin_B"] = pd.cut(df["B"], bins=np.linspace(0, 1, N_BINS_B + 1))

# Specify all possible bin combinations to use for columns
bin_A_bin_B_combinations = itertools.product(
    df['bin_A'].cat.categories, 
    df['bin_B'].cat.categories,
)

# Loop over possible combinations and mark matches
for i, (bin_A, bin_B) in enumerate(bin_A_bin_B_combinations):
    df[f"bin_{i}"] = (
        (df["bin_A"] == bin_A) & (df["bin_B"] == bin_B)
    ).astype(int)

Optimize Python code for faster processing

Answers (2)

Related Questions