Reputation: 33
I have an excel worksheet that I am reading into pandas for parsing and later analysis. It has the following format. All values are strings. They will be converted to floats/ints later but having them as strings helps with parsing.
column1 | column2 | column3 |
-----------------------------
12345 |10 |20 |
txt |25 |65 |
35615 |15 |20 |
txt |35 |20 |
I need to get the index of all 5 digit, numerical values in column1. It will always be a 5 digit. I am using the following regex.
\b\d{5}\b
I am having problems getting pandas to properly match the 5 digits when using any of the built in string methods.
I have tried the following.
df.column1.str.contains('\b\d{5}\b', regex=True).index.list()
df.column1.str.match('\b\d{5}\b').index.list()
I am expecting it to return
[0,2]
Both of these return an empty list. What am I doing wrong?
Upvotes: 3
Views: 1073
Reputation: 863226
Add r
before string, filter by boolean indexing
and get index values to list:
i = df[df.column1.str.contains(r'\b\d{5}\b')].index.tolist()
print (i)
[0, 2]
Or if want parse only numeric values with length 5
change regex with ^
and $
for start and end of string:
i = df[df.column1.str.contains(r'^\d{5}$')].index.tolist()
Upvotes: 3