Igor Rivin
Igor Rivin

Reputation: 4864

replacing dots in pandas dataframe

I have a dataframe as in the image. (the numbers are really objects). Doing df.treasury_rate = pd.to_numeric(df.treasury_rate), predictably bombs. However, doing df.replace('.', np.nan) does not appear to get rid of the dot, so I am flummoxed. Any suggestions?

UPDATE pd.to_numeric takes an errors keyword, which, when set to coerce does the right thing, however, I am still confused as to why the . is not getting replaced.

UPDATE 2 As text:

treasury_rate
1962-02-09 4.05
1962-02-10 4.05
1962-02-11 4.05
1962-02-12 .
1962-02-13 4.03
1962-02-14 4.03
1962-02-15 4.02 
1962-02-16 4.02
1962-02-17 4.02
1962-02-18 4.02

Upvotes: 2

Views: 7564

Answers (4)

Gonçalo Peres
Gonçalo Peres

Reputation: 13582

Given OP's dataframe

import pandas as pd
import datetime as dt

df = pd.DataFrame({'treasury_rate': [4.05, 4.05, 4.05, ".", 4.03, 4.03, 4.02, 4.02, 4.02, 4.0]},
                    index=[dt.datetime(1962, 2, 9), dt.datetime(1962, 2, 10), dt.datetime(1962, 2, 11), dt.datetime(1962, 2, 12), dt.datetime(1962, 2, 13), dt.datetime(1962, 2, 14), dt.datetime(1962, 2, 15), dt.datetime(1962, 2, 16), dt.datetime(1962, 2, 17), dt.datetime(1962, 2, 18)])


[Out]:
           treasury_rate
1962-02-09          4.05
1962-02-10          4.05
1962-02-11          4.05
1962-02-12             .
1962-02-13          4.03
1962-02-14          4.03
1962-02-15          4.02
1962-02-16          4.02
1962-02-17          4.02
1962-02-18           4.0

There are various ways one can go about replacing the dot with NaN.

  1. Using regular expression (with Python's re)

    import re
    
    df_new = df.replace(re.compile('\.'), np.nan)
    
  2. Using a custom lambda function and np.nan

    import numpy as np
    
    df_new = df.treasury_rate.apply(lambda x: np.nan if x == '.' else x)
    
  3. Using pandas.DataFrame.replace and np.nan

    import numpy as np
    
    df_new = df.replace('.', np.nan)
    
    # or
    
    df_new = df.replace(to_replace=r'\.', value=np.nan, regex=True)
    

In all three cases the output is

            treasury_rate
1962-02-09           4.05
1962-02-10           4.05
1962-02-11           4.05
1962-02-12            NaN
1962-02-13           4.03
1962-02-14           4.03
1962-02-15           4.02
1962-02-16           4.02
1962-02-17           4.02
1962-02-18           4.00

Notes:

  • If one wants to do the operations on the same dataframe, one can use inplace=True such as

    df.replace(re.compile('\.'), np.nan)
    
  • Pandas has their own null value. So, instead of np.nan, one might want to use pd.NA. For more information on the difference between those two: pd.NA vs np.nan for pandas If one uses pd.NA, the output would be

               treasury_rate
    1962-02-09          4.05
    1962-02-10          4.05
    1962-02-11          4.05
    1962-02-12          <NA>
    1962-02-13          4.03
    1962-02-14          4.03
    1962-02-15          4.02
    1962-02-16          4.02
    1962-02-17          4.02
    1962-02-18           4.0
    
  • There are strong opinions on using .apply. If one wants to read some: When should I (not) want to use pandas apply() in my code?

Upvotes: 0

Nusrath
Nusrath

Reputation: 527

df.replace('\.','0', regex=True,inplace=True)

I think you have to give regex for '.' character to replace it.

Upvotes: 4

cullzie
cullzie

Reputation: 2755

I think you just need to add inplace=True if you want to update the df object inplace:

df.replace('.', np.nan, inplace=True)

Otherwise df.replace returns an new dataframe which you must assign to a variable to access:

new_df = df.replace('.', np.nan)

Upvotes: 2

Brandon Bertelsen
Brandon Bertelsen

Reputation: 44658

In this case creating a mask / subset and replacing the string literal would be better than attempting to use replace. You'll probably also want to convert your series to numeric you can also do it all in one step if you know your data is clean other than the "." character:

subs = df.treasury_rate == "."
df.loc[subs, 'treasure_rate'] = np.nan
# OR
df.treasury_rate = pd.to_numeric('treasury_rate', errors='coerce')

You could likely run the last line and have the dots replaced with NaN

Upvotes: 3

Related Questions