Reputation: 13
I want to replace all values in a dataframe column that starts with "-99." using regex with NaN
as these are the outliers.
I used df['Item'].replace(r(^[-][9][9]\d.*$),np.NaN)
but it did not work.
Upvotes: 1
Views: 1066
Reputation: 11002
The regular expression posted by @tripleee
is fine to detect numbers (encoded as string) starting with -99.
The problem here is you are dealing with number and regular expression are only suited for string.
Lets build a comprehensive example:
import numpy as np
import pandas as pd
df = pd.DataFrame([-999, -99.9, -9, 9, 99.9, 0., 1, -999], columns=['Item'])
Item
0 -999.0
1 -99.9
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
Then you can match
outliers using the regular expression (provided the string format is suitable for), then all you need is to cast (astype
) into string before applying regular expression (which resides in str
toolsuite of Series).
q1 = df['Item'].astype(str).str.match(r'^-99\..*')
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
But if you intend to replace those value by nan
using the replace function of string object then it will requires extra steps as this replace
function expect another string and nothing else (using np.nan
or None
will fail). Then you will have to execute:
df['Item'].astype(str).str.replace(r'^-99\..*', 'nan').astype(float)
IMO this is a pretty bad one-liner because of "unnecessary" casting which spoils the very nature of your data.
You better go for logical indexing using the boolean vector above, either by replacing by sentinel:
df.loc[q1] = np.nan
Item
0 -999.0
1 NaN
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
or slicing:
df = df.loc[~q1,:]
Item
0 -999.0
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
7 -999.0
Anyway converting number into string to detect outlier seems a bit odd (poor performance, complex behaviour hard to debug, extra copy of data).
If there is no reason that numbers less than -99.
are still valid, then you can filter them out using a simple numerical criterion:
q2 = df['Item'] <= -99.
df = df.loc[~q2,:]
Item
2 -9.0
3 9.0
4 99.9
5 0.0
6 1.0
Which will perform way better and avoid to cast numbers to string and vice versa. It also avoid the need of extra copy of data (string, then float again, then overwrite initial data). So it will be both memory (copy of data) and computationally (regular expression are intensive) efficient with regards to your first choice.
If numbers less than the cut off must be kept then you can still perform it with float arithmetic. Just change the less than criterion for an epsilon ball criterion around the desired value. To capture all numbers within [-100., -99.]
you can use the following setup:
target = -99.5
epsilon = 0.5
q3 = np.abs(df['Item'] - target) <= epsilon
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
Off course you can change the target
and make epsilon
as small as possible with regard to your machine precision.
Upvotes: 1
Reputation: 189377
Dunno about Pandas, but the code you show lacks quotes, and of course the regex doesn't do what you say you want to do. \d*.$
says it has to end with a digit followed by any character. Probably you mean
df['Item'].replace(r'^-99\..*',np.NaN)
where the ^
anchor means beginning of line (or, here, beginning of the cell) and -99
just matches literal text. Finally \.
matches a literal dot, and .*
matches anything after that, up until the end of the cell.
Upvotes: 0