Elham
Elham

Reputation: 867

Pandas data frame

I have a question regarding my following code, I have a data set and a list , I want to compare each data value of my data set with two conditions, if the condition is true then keep the previous value of the data frame otherwise make it as None, My code works perfectly for small data set however it will takes too much time and without any values for my big data set. Is there better solution?

new_data=data
    for col in df.columns:
        for i in range(len(df)):
            if (df.iloc[i][col] >list_min[i] ) & (df.iloc[i][col]<list_max[i]):
                new_data.set_value(i,col,df.iloc[i][col])
            else:
                new_data.set_value(i,col,None)

thanks for comments or another solution.

This is my code that is not work :

data = pd.read_csv('./dataset/w.csv')
i=0
data = data.applymap(np.log)
data = data.drop('time', axis=1)
q75_list = []
q25_list = []
iqr_list = []
min_list = []
max_list = []
new_data=data
for col in data.columns.values:
    q75_list.append(np.nanpercentile(data[col], 75))
    q25_list.append(np.nanpercentile(data[col], 25))

    iqr_list = np.array(q75_list) - np.array(q25_list)
    min_list = np.array(q25_list) - (np.array(iqr_list * 1.5))
    max_list = np.array(q75_list) + (np.array(iqr_list * 1.5))

print("Max :\n",max_list,"\n Min :\n",min_list)

for col in data.columns:
    for (i, j) in [(i, j) for i in range(len(data)) for j in range(len(min_list))]:

        if (data.iloc[i][col] >min_list[j] ) & (data.iloc[i][col]<max_list[j]):
            new_data.set_value(i,col,data.iloc[i][col])
        else:
            new_data.set_value(i,col,None)

new_data.to_csv('./dataset/result.csv',index=False)

Upvotes: 0

Views: 289

Answers (2)

Rhosu
Rhosu

Reputation: 81

If I am correctly understanding what you are doing, there are a couple places you could try to vectorize things. See if this speeds things up:

q75s = data.quantile(.75)
q25s = data.quantile(.25)
mins = 2.5*q25s - 1.5*q75s
maxs = 2.5*q75s - 1.5*q25s

newdata = data.copy()
newdata[(data < mins) | (data > maxs)] = None

Upvotes: 1

Parfait
Parfait

Reputation: 107767

Consider the if/then/else idiom using pandas.DataFrame.loc. Below assumes list_min and list_max are lists of equal length to number of rows in data.

for col in data.columns:
    new_data.loc[(data[col] > pd.Series(list_min)) & 
                 (data[col] < pd.Series(list_max)), col] = data[col] 

    new_data.loc[(data[col] < pd.Series(list_min)) |
                 (data[col] > pd.Series(list_max)), col] = None

To demonstrate with example random data of 10 cols and 50 rows (seeded to reproduce):

Data

import pandas as pd
import numpy as np

pd.set_option('display.width', 1000)

np.random.seed(107)
data = pd.DataFrame([[np.random.randint(50) for _ in range(50)] for _ in range(10)]).T
print(data.head())
#     0   1   2   3   4   5   6   7   8   9
# 0  48  17  37  22   1   0   6  14  33  10
# 1  25  38  28   4  36  22   4  24  28  49
# 2   6   5  22  35  14  14  40  41  38  26
# 3  14  43   5  31  38  45  40   5  32   1
# 4  11  30  35  32  20  37  26  39  34   5

list_min = [np.random.randint(50) for _ in range(50)]
print(list_min[:10])
# [37, 17, 33, 24, 0, 46, 11, 4, 25, 41]
list_max = [np.random.randint(50) for _ in range(50)]
print(list_max[:10])
# [45, 37, 49, 38, 31, 9, 20, 39, 7, 36]

Operation

new_data = data.loc[:,]

for col in data.columns:
        new_data.loc[(data[col] > pd.Series(list_min)) & 
                     (data[col] < pd.Series(list_max)), col] = data[col] 

        new_data.loc[(data[col] < pd.Series(list_min)) |
                     (data[col] > pd.Series(list_max)), col] = None

print(new_data.head())
#       0     1     2     3     4     5     6     7     8    9
# 0   NaN   NaN  37.0   NaN   NaN   NaN   NaN   NaN   NaN  NaN
# 1  25.0   NaN  28.0   NaN  36.0  22.0   NaN  24.0  28.0  NaN
# 2   NaN   NaN   NaN  35.0   NaN   NaN  40.0  41.0  38.0  NaN
# 3   NaN   NaN   NaN  31.0  38.0   NaN   NaN   NaN  32.0  NaN
# 4  11.0  30.0   NaN   NaN  20.0   NaN  26.0   NaN   NaN  5.0

Upvotes: 1

Related Questions