niha1234
niha1234

Reputation: 13

Finding specific digit pattern with regex in python

I want to replace all values in a dataframe column that starts with "-99." using regex with NaN as these are the outliers.

I used df['Item'].replace(r(^[-][9][9]\d.*$),np.NaN) but it did not work.

Upvotes: 1

Views: 1066

Answers (2)

jlandercy
jlandercy

Reputation: 11002

TL;DR

The regular expression posted by @tripleee is fine to detect numbers (encoded as string) starting with -99. The problem here is you are dealing with number and regular expression are only suited for string.

MCVE

Lets build a comprehensive example:

import numpy as np
import pandas as pd

df = pd.DataFrame([-999, -99.9, -9, 9, 99.9, 0., 1, -999], columns=['Item'])

    Item
0 -999.0
1  -99.9
2   -9.0
3    9.0
4   99.9
5    0.0
6    1.0
7 -999.0

Regular Expression

Then you can match outliers using the regular expression (provided the string format is suitable for), then all you need is to cast (astype) into string before applying regular expression (which resides in str toolsuite of Series).

q1 = df['Item'].astype(str).str.match(r'^-99\..*')

0    False
1     True
2    False
3    False
4    False
5    False
6    False
7    False

But if you intend to replace those value by nan using the replace function of string object then it will requires extra steps as this replace function expect another string and nothing else (using np.nan or None will fail). Then you will have to execute:

df['Item'].astype(str).str.replace(r'^-99\..*', 'nan').astype(float)

IMO this is a pretty bad one-liner because of "unnecessary" casting which spoils the very nature of your data.

Logical Indexing

You better go for logical indexing using the boolean vector above, either by replacing by sentinel:

df.loc[q1] = np.nan

    Item
0 -999.0
1    NaN
2   -9.0
3    9.0
4   99.9
5    0.0
6    1.0
7 -999.0

or slicing:

df = df.loc[~q1,:]

    Item
0 -999.0
2   -9.0
3    9.0
4   99.9
5    0.0
6    1.0
7 -999.0

Anyway converting number into string to detect outlier seems a bit odd (poor performance, complex behaviour hard to debug, extra copy of data).

Float Arithmetic

Simple filter

If there is no reason that numbers less than -99. are still valid, then you can filter them out using a simple numerical criterion:

q2 = df['Item'] <= -99.
df = df.loc[~q2,:]

   Item
2  -9.0
3   9.0
4  99.9
5   0.0
6   1.0

Which will perform way better and avoid to cast numbers to string and vice versa. It also avoid the need of extra copy of data (string, then float again, then overwrite initial data). So it will be both memory (copy of data) and computationally (regular expression are intensive) efficient with regards to your first choice.

Epsilon ball filter

If numbers less than the cut off must be kept then you can still perform it with float arithmetic. Just change the less than criterion for an epsilon ball criterion around the desired value. To capture all numbers within [-100., -99.] you can use the following setup:

target = -99.5
epsilon = 0.5
q3 = np.abs(df['Item'] - target) <= epsilon

0    False
1     True
2    False
3    False
4    False
5    False
6    False
7    False

Off course you can change the target and make epsilon as small as possible with regard to your machine precision.

Upvotes: 1

tripleee
tripleee

Reputation: 189377

Dunno about Pandas, but the code you show lacks quotes, and of course the regex doesn't do what you say you want to do. \d*.$ says it has to end with a digit followed by any character. Probably you mean

df['Item'].replace(r'^-99\..*',np.NaN)

where the ^ anchor means beginning of line (or, here, beginning of the cell) and -99 just matches literal text. Finally \. matches a literal dot, and .* matches anything after that, up until the end of the cell.

Upvotes: 0

Related Questions