Reputation: 867
I have a question regarding my following code, I have a data set and a list , I want to compare each data value of my data set with two conditions, if the condition is true then keep the previous value of the data frame otherwise make it as None, My code works perfectly for small data set however it will takes too much time and without any values for my big data set. Is there better solution?
new_data=data
for col in df.columns:
for i in range(len(df)):
if (df.iloc[i][col] >list_min[i] ) & (df.iloc[i][col]<list_max[i]):
new_data.set_value(i,col,df.iloc[i][col])
else:
new_data.set_value(i,col,None)
thanks for comments or another solution.
This is my code that is not work :
data = pd.read_csv('./dataset/w.csv')
i=0
data = data.applymap(np.log)
data = data.drop('time', axis=1)
q75_list = []
q25_list = []
iqr_list = []
min_list = []
max_list = []
new_data=data
for col in data.columns.values:
q75_list.append(np.nanpercentile(data[col], 75))
q25_list.append(np.nanpercentile(data[col], 25))
iqr_list = np.array(q75_list) - np.array(q25_list)
min_list = np.array(q25_list) - (np.array(iqr_list * 1.5))
max_list = np.array(q75_list) + (np.array(iqr_list * 1.5))
print("Max :\n",max_list,"\n Min :\n",min_list)
for col in data.columns:
for (i, j) in [(i, j) for i in range(len(data)) for j in range(len(min_list))]:
if (data.iloc[i][col] >min_list[j] ) & (data.iloc[i][col]<max_list[j]):
new_data.set_value(i,col,data.iloc[i][col])
else:
new_data.set_value(i,col,None)
new_data.to_csv('./dataset/result.csv',index=False)
Upvotes: 0
Views: 289
Reputation: 81
If I am correctly understanding what you are doing, there are a couple places you could try to vectorize things. See if this speeds things up:
q75s = data.quantile(.75)
q25s = data.quantile(.25)
mins = 2.5*q25s - 1.5*q75s
maxs = 2.5*q75s - 1.5*q25s
newdata = data.copy()
newdata[(data < mins) | (data > maxs)] = None
Upvotes: 1
Reputation: 107767
Consider the if/then/else idiom using pandas.DataFrame.loc. Below assumes list_min and list_max are lists of equal length to number of rows in data.
for col in data.columns:
new_data.loc[(data[col] > pd.Series(list_min)) &
(data[col] < pd.Series(list_max)), col] = data[col]
new_data.loc[(data[col] < pd.Series(list_min)) |
(data[col] > pd.Series(list_max)), col] = None
To demonstrate with example random data of 10 cols and 50 rows (seeded to reproduce):
Data
import pandas as pd
import numpy as np
pd.set_option('display.width', 1000)
np.random.seed(107)
data = pd.DataFrame([[np.random.randint(50) for _ in range(50)] for _ in range(10)]).T
print(data.head())
# 0 1 2 3 4 5 6 7 8 9
# 0 48 17 37 22 1 0 6 14 33 10
# 1 25 38 28 4 36 22 4 24 28 49
# 2 6 5 22 35 14 14 40 41 38 26
# 3 14 43 5 31 38 45 40 5 32 1
# 4 11 30 35 32 20 37 26 39 34 5
list_min = [np.random.randint(50) for _ in range(50)]
print(list_min[:10])
# [37, 17, 33, 24, 0, 46, 11, 4, 25, 41]
list_max = [np.random.randint(50) for _ in range(50)]
print(list_max[:10])
# [45, 37, 49, 38, 31, 9, 20, 39, 7, 36]
Operation
new_data = data.loc[:,]
for col in data.columns:
new_data.loc[(data[col] > pd.Series(list_min)) &
(data[col] < pd.Series(list_max)), col] = data[col]
new_data.loc[(data[col] < pd.Series(list_min)) |
(data[col] > pd.Series(list_max)), col] = None
print(new_data.head())
# 0 1 2 3 4 5 6 7 8 9
# 0 NaN NaN 37.0 NaN NaN NaN NaN NaN NaN NaN
# 1 25.0 NaN 28.0 NaN 36.0 22.0 NaN 24.0 28.0 NaN
# 2 NaN NaN NaN 35.0 NaN NaN 40.0 41.0 38.0 NaN
# 3 NaN NaN NaN 31.0 38.0 NaN NaN NaN 32.0 NaN
# 4 11.0 30.0 NaN NaN 20.0 NaN 26.0 NaN NaN 5.0
Upvotes: 1