Edamame
Edamame

Reputation: 25366

pandas: error when loop over a given pandas row

I have the following code:

df_boundry = df_in.dropna().quantile([0.0, .8])
for row in df_in.iterrows():
    for column in row:
        if row[column] > df_boundry[column][0.8]:
            row[column] = df_boundry[column][0.8]

Basically, for each given row (record), we check each column value. If the value is more than 80 percentile, we replaced it with the 80-percential value. But I got errors in the above code:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-67-81b2be77cc8a> in <module>()
      4 for row in df_in.iterrows():
      5     for column in row:
----> 6         if row[column] > df_boundry[column][0.8]:
      7             row[column] = df_boundry[column][0.8]
      8 

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/home/edamame/anaconda2/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 0

Here is some sample data for df_in:

    column_A | column_B | column_C
    --------------------------------
     0.5     |   0.5    |  NaN
     1.2     |   NaN    |  NaN
     NaN     |   8.1    | 21.1
     9.1     |   9.3    |  2.1
     4.5     |  90.1    |  1.4
   112.3     |  79.2    |  1.1
       :
       :

and df_boundry:

    |  column_A  |  column_B  |  column_C
----------------------------------------
0.0 |     0.1    |    0.4     |   0.0
0.8 |    110.4   |   80.1     |  20.5

The expected results for the sample data should be

    column_A | column_B | column_C
    --------------------------------
     0.5     |   0.5    |  NaN
     1.2     |   NaN    |  NaN
     NaN     |   8.1    | 20.5
     9.1     |   9.3    |  2.1
     4.5     |  80.1    |  1.4
   110.4     |  79.2    |  1.1
       :
       :

i.e. only if the cell value > df_boundry[column][0.8], we replace it with df_boundry[column][0.8]

Does anyone know what I missed here? Thanks!

Upvotes: 1

Views: 392

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

UPDATE2:

In [7]: df_boundry
Out[7]:
     column_A  column_B  column_C
0.0       0.1       0.4       0.0
0.8     110.4      80.1      20.5

In [8]: df_boundry.iloc[-1]
Out[8]:
column_A    110.4
column_B     80.1
column_C     20.5
Name: 0.8, dtype: float64

In [9]: df_boundry.iloc[[-1]]
Out[9]:
     column_A  column_B  column_C
0.8     110.4      80.1      20.5

UPDATE: still using the same old code with your updated DFs:

In [373]: df_boundry
Out[373]:
     column_A  column_B  column_C
0.0       0.1       0.4       0.0
0.8     110.4      80.1      20.5

In [374]: df
Out[374]:
   column_A  column_B  column_C
0       0.5       0.5       NaN
1       1.2       NaN       NaN
2       NaN       8.1       1.1
3       9.1       9.3       2.1
4       4.5       1.1       1.4
5     112.3      79.2       1.1

In [375]: sav = df.copy()

In [376]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index)

In [377]: df
Out[377]:
   column_A  column_B  column_C
0       0.5       0.5       NaN
1       1.2       NaN       NaN
2       NaN       8.1       1.1
3       9.1       9.3       2.1
4       4.5       1.1       1.4
5     110.4      79.2       1.1

OLD answer:

you can do it this (vectorized) way:

In [350]: df
Out[350]:
   column_A  column_B  column_C
0       0.5       0.5       NaN
1       1.2       NaN       NaN
2       NaN       8.1       1.1
3       9.1       9.3       2.1
4       4.5       1.1       1.4

In [351]: df_boundry = df.dropna().quantile([0.0, .8])

In [352]: df_boundry
Out[352]:
     column_A  column_B  column_C
0.0      4.50      1.10      1.40
0.8      8.18      7.66      1.96

In [353]: df[df > df_boundry.iloc[-1]] = pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index)

In [354]: df
Out[354]:
   column_A  column_B  column_C
0      0.50      0.50       NaN
1      1.20       NaN       NaN
2       NaN      7.66      1.10
3      8.18      7.66      1.96
4      4.50      1.10      1.40

Explanation:

In [365]: df > df_boundry.iloc[-1]
Out[365]:
  column_A column_B column_C
0    False    False    False
1    False    False    False
2    False     True    False
3     True     True     True
4    False    False    False

In [356]: df_boundry.iloc[[-1]]
Out[356]:
     column_A  column_B  column_C
0.8      8.18      7.66      1.96

In [357]: pd.concat([df_boundry.iloc[[-1]]] * len(df))
Out[357]:
     column_A  column_B  column_C
0.8      8.18      7.66      1.96
0.8      8.18      7.66      1.96
0.8      8.18      7.66      1.96
0.8      8.18      7.66      1.96
0.8      8.18      7.66      1.96

In [358]: pd.concat([df_boundry.iloc[[-1]]] * len(df)).set_index(df.index)
Out[358]:
   column_A  column_B  column_C
0      8.18      7.66      1.96
1      8.18      7.66      1.96
2      8.18      7.66      1.96
3      8.18      7.66      1.96
4      8.18      7.66      1.96

Upvotes: 2

shawnheide
shawnheide

Reputation: 807

Instead of calculating all of the quantiles upfront, you can use the apply method of the DataFrame and operate on each column separately.

def fill_with_quantile(col, q=0.80):
    q_value = col.dropna().quantile(q)
    col[col > q_value] = q_value
    return col

df_in.apply(lambda col: fill_with_quantile(col, 0.8), axis=0)

If you wanted you could also change the fill_with_quantile function to fill upper and lower extreme values (i.e. 0.2 and 0.8).

Upvotes: 1

Related Questions