Pandas how to not apply to whole column

Question

self.df['Regular Price'] = self.df['Regular Price'].apply(
            lambda x: int(round(x)) if isinstance(
                x, (int, float)) else None
        )

The above code is assigning None to every value of field Regular Price whenever it encounter a non numeric value in the dataframe. I want to assign None to only that cell where its non number value.

thanks

jezrael · Accepted Answer

First is impossible return NaNs with integers, because NaNs is float by design.

Your solution working if mixed types - numeric with strings:

df = pd.DataFrame({
    'Regular Price': ['a',1,2.3,'a',7],
    'B': list(range(5))
})
print (df)
   B Regular Price
0  0             a
1  1             1
2  2           2.3
3  3             a
4  4             7

df['Regular Price'] = df['Regular Price'].apply(
            lambda x: int(round(x)) if isinstance(
                x, (int, float)) else None
        )

print (df)
   B  Regular Price
0  0            NaN
1  1            1.0
2  2            2.0
3  3            NaN
4  4            7.0

But if all data are strings need to_numeric with errors='coerce' for convert not numeric to NaNs:

df = pd.DataFrame({
    'Regular Price': ['a','1','2.3','a','7'],
    'B': list(range(5))
})
print (df)
   B Regular Price
0  0             a
1  1             1
2  2           2.3
3  3             a
4  4             7

df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
print (df)
   B  Regular Price
0  0            NaN
1  1            1.0
2  2            2.0
3  3            NaN
4  4            7.0

EDIT:

I also need to remove floating points and use int only

It is possible by convert to None for NaNs and cast to int:

df['Regular Price'] = pd.to_numeric(df['Regular Price'],
                                    errors='coerce').round()

df['Regular Price'] = np.where(df['Regular Price'].isnull(), 
                               None,
                               df['Regular Price'].fillna(0).astype(int))

print (df)
   B Regular Price
0  0          None
1  1             1
2  2             2
3  3          None
4  4             7


print (df['Regular Price'].apply(type))
0    
1         
2         
3    
4         
Name: Regular Price, dtype: object

But it slow performance, so the best dont use it. There also should be another problems - soe function failed, so the best is floats if working with NaNs:

Testing some function like diff in 50k rows DataFrame:

df = pd.DataFrame({
    'Regular Price': ['a','1','2.3','a','7'],
    'B': list(range(5))
})
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)

df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()

df['Regular Price1'] = np.where(df['Regular Price'].isnull(), 
                               None,
                               df['Regular Price'].fillna(0).astype(int))

In [252]: %timeit df['Regular Price2'] = df['Regular Price1'].diff()

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

In [274]: %timeit df['Regular Price3'] = df['Regular Price'].diff()
1000 loops, best of 3: 301 µs per loop

In [272]: %timeit df['Regular Price2'] = df['Regular Price1'] * 1000
100 loops, best of 3: 4.48 ms per loop

In [273]: %timeit df['Regular Price3'] = df['Regular Price'] * 1000
1000 loops, best of 3: 469 µs per loop

EDIT:

df = pd.DataFrame({
    'Regular Price': ['a','1','2.3','a','7'],
    'B': list(range(5))
})
print (df)
   B Regular Price
0  0             a
1  1             1
2  2           2.3
3  3             a
4  4             7

df['Regular Price'] = pd.to_numeric(df['Regular Price'], errors='coerce').round()
print (df)
   B  Regular Price
0  0            NaN
1  1            1.0
2  2            2.0
3  3            NaN
4  4            7.0

First is possible remove NaNs rows by column Regular Price and then convert to int.

df1 = df.dropna(subset=['Regular Price']).copy()
df1['Regular Price']  = df1['Regular Price'].astype(int)
print (df1)
   B  Regular Price
1  1              1
2  2              2
4  4              7

Process what you need, but dont change index.

#e.g. some process 
df1['Regular Price']  = df1['Regular Price'] * 100

Last combine_first - it add NaN to Regular Price column.

df2 = df1.combine_first(df)
print (df2)
     B  Regular Price
0  0.0            NaN
1  1.0          100.0
2  2.0          200.0
3  3.0            NaN
4  4.0          700.0

Pandas how to not apply to whole column

Answers (1)

Related Questions