Reputation:
I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:
def normalize(df):
result = df.copy()
for col in tqdm(df.columns):
if col!=str('name') #basically not to normalize columns with name of "name"
max_value = df[col].max()
min_value = df[col].min()
result[col] = (df[col] - min_value) / (max_value - min_value)
return result
It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).
Upvotes: 0
Views: 983
Reputation: 36370
I am afraid that it will also take quite a lot of time
Then considering that you just need numerical operations I suggest using numpy
for actual number crunching and pandas
only for extraction of columns to process, simple example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)
output
name x1 x2 x3
0 A 0.0 0.0 0.00
1 B 0.5 1.0 0.25
2 C 1.0 0.5 1.00
Disclaimer: this solutions assumes that df
has indices like 0,1,2,...
If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.
Upvotes: 0
Reputation: 956
You don't need to loop through this. When the other columns than name
are numerical values then you can just do something along the following:
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
Here is a minimal code sample:
import pandas as pd
df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
print(df)
Upvotes: 2