Reputation: 25366
I have the following code:
df_boundry = df_in.dropna().quantile([0.0, .8])
for row in df_in.iterrows():
for column in row:
if row[column] > df_boundry[column][0.8]:
row[column] = df_boundry[column][0.8]
Basically, for each given row (record), we check each column value. If the value is more than 80 percentile, we replaced it with the 80-percential value. But I got errors in the above code:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-67-81b2be77cc8a> in <module>()
4 for row in df_in.iterrows():
5 for column in row:
----> 6 if row[column] > df_boundry[column][0.8]:
7 row[column] = df_boundry[column][0.8]
8
/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
1995 return self._getitem_multilevel(key)
1996 else:
-> 1997 return self._getitem_column(key)
1998
1999 def _getitem_column(self, key):
/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
2002 # get column
2003 if self.columns.is_unique:
-> 2004 return self._get_item_cache(key)
2005
2006 # duplicate columns & possible reduce dimensionality
/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1348 res = cache.get(item)
1349 if res is None:
-> 1350 values = self._data.get(item)
1351 res = self._box_item_values(item, values)
1352 cache[item] = res
/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3288
3289 if not isnull(item):
-> 3290 loc = self.items.get_loc(item)
3291 else:
3292 indexer = np.arange(len(self.items))[isnull(self.items)]
/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1945 return self._engine.get_loc(key)
1946 except KeyError:
-> 1947 return self._engine.get_loc(self._maybe_cast_indexer(key))
1948
1949 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()
KeyError: 0
Here is some sample data for df_in:
column_A | column_B | column_C
--------------------------------
0.5 | 0.5 | NaN
1.2 | NaN | NaN
NaN | 8.1 | 21.1
9.1 | 9.3 | 2.1
4.5 | 90.1 | 1.4
112.3 | 79.2 | 1.1
:
:
and df_boundry:
| column_A | column_B | column_C
----------------------------------------
0.0 | 0.1 | 0.4 | 0.0
0.8 | 110.4 | 80.1 | 20.5
The expected results for the sample data should be
column_A | column_B | column_C
--------------------------------
0.5 | 0.5 | NaN
1.2 | NaN | NaN
NaN | 8.1 | 20.5
9.1 | 9.3 | 2.1
4.5 | 80.1 | 1.4
110.4 | 79.2 | 1.1
:
:
i.e. only if the cell value > df_boundry[column][0.8], we replace it with df_boundry[column][0.8]
Does anyone know what I missed here? Thanks!
Upvotes: 1
Views: 392
Reputation: 210832
UPDATE2:
In [7]: df_boundry
Out[7]:
column_A column_B column_C
0.0 0.1 0.4 0.0
0.8 110.4 80.1 20.5
In [8]: df_boundry.iloc[-1]
Out[8]:
column_A 110.4
column_B 80.1
column_C 20.5
Name: 0.8, dtype: float64
In [9]: df_boundry.iloc[[-1]]
Out[9]:
column_A column_B column_C
0.8 110.4 80.1 20.5
UPDATE: still using the same old code with your updated DFs:
In [373]: df_boundry
Out[373]:
column_A column_B column_C
0.0 0.1 0.4 0.0
0.8 110.4 80.1 20.5
In [374]: df
Out[374]:
column_A column_B column_C
0 0.5 0.5 NaN
1 1.2 NaN NaN
2 NaN 8.1 1.1
3 9.1 9.3 2.1
4 4.5 1.1 1.4
5 112.3 79.2 1.1
In [375]: sav = df.copy()
In [376]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index)
In [377]: df
Out[377]:
column_A column_B column_C
0 0.5 0.5 NaN
1 1.2 NaN NaN
2 NaN 8.1 1.1
3 9.1 9.3 2.1
4 4.5 1.1 1.4
5 110.4 79.2 1.1
OLD answer:
you can do it this (vectorized) way:
In [350]: df
Out[350]:
column_A column_B column_C
0 0.5 0.5 NaN
1 1.2 NaN NaN
2 NaN 8.1 1.1
3 9.1 9.3 2.1
4 4.5 1.1 1.4
In [351]: df_boundry = df.dropna().quantile([0.0, .8])
In [352]: df_boundry
Out[352]:
column_A column_B column_C
0.0 4.50 1.10 1.40
0.8 8.18 7.66 1.96
In [353]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index)
In [354]: df
Out[354]:
column_A column_B column_C
0 0.50 0.50 NaN
1 1.20 NaN NaN
2 NaN 7.66 1.10
3 8.18 7.66 1.96
4 4.50 1.10 1.40
Explanation:
In [365]: df > df_boundry.iloc[-1]
Out[365]:
column_A column_B column_C
0 False False False
1 False False False
2 False True False
3 True True True
4 False False False
In [356]: df_boundry.iloc[[-1]]
Out[356]:
column_A column_B column_C
0.8 8.18 7.66 1.96
In [357]: pd.concat([df_boundry.iloc[[-1]]] * len(df))
Out[357]:
column_A column_B column_C
0.8 8.18 7.66 1.96
0.8 8.18 7.66 1.96
0.8 8.18 7.66 1.96
0.8 8.18 7.66 1.96
0.8 8.18 7.66 1.96
In [358]: pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index)
Out[358]:
column_A column_B column_C
0 8.18 7.66 1.96
1 8.18 7.66 1.96
2 8.18 7.66 1.96
3 8.18 7.66 1.96
4 8.18 7.66 1.96
Upvotes: 2
Reputation: 807
Instead of calculating all of the quantiles upfront, you can use the apply
method of the DataFrame and operate on each column separately.
def fill_with_quantile(col, q=0.80):
q_value = col.dropna().quantile(q)
col[col > q_value] = q_value
return col
df_in.apply(lambda col: fill_with_quantile(col, 0.8), axis=0)
If you wanted you could also change the fill_with_quantile
function to fill upper and lower extreme values (i.e. 0.2 and 0.8).
Upvotes: 1