Rusty
Rusty

Reputation: 33

Get index of regex match in pandas dataframe not working

I have an excel worksheet that I am reading into pandas for parsing and later analysis. It has the following format. All values are strings. They will be converted to floats/ints later but having them as strings helps with parsing.

column1  |  column2 | column3 |
-----------------------------
12345   |10         |20       |
txt     |25         |65       |
35615   |15         |20       |
txt     |35         |20       |

I need to get the index of all 5 digit, numerical values in column1. It will always be a 5 digit. I am using the following regex.

\b\d{5}\b

I am having problems getting pandas to properly match the 5 digits when using any of the built in string methods.

I have tried the following.

df.column1.str.contains('\b\d{5}\b', regex=True).index.list()
df.column1.str.match('\b\d{5}\b').index.list()

I am expecting it to return

[0,2]

Both of these return an empty list. What am I doing wrong?

Upvotes: 3

Views: 1073

Answers (1)

jezrael
jezrael

Reputation: 863226

Add r before string, filter by boolean indexing and get index values to list:

i = df[df.column1.str.contains(r'\b\d{5}\b')].index.tolist()
print (i)
[0, 2]

Or if want parse only numeric values with length 5 change regex with ^ and $ for start and end of string:

i = df[df.column1.str.contains(r'^\d{5}$')].index.tolist()

Upvotes: 3

Related Questions