Reputation: 4557
Working with a text dataset, I have an extraction that gives me irregular results in a dataframe. I am not very good with regular expressions and have never done a filter trying one so help would be appreciated.
I am trying to filter column a for rows 4 & 6. The pattern is 4 numbers, a letter, a space, / space, 2 numbers, space, /, space, 5 numbers, space, /, then whatever follows.
The dataframe looks like this:
a b c d
0 1234B:Program Name / Title Chapter Page Number ID Code
1 1234B:Program Name / Title Chapter Page Number ID Code
2 1234B:Program Name / Title Chapter Page Number ID Code
3 1234B / 01 / 2 (blank) (blank) ID Code
4 1234B / 01 / 23456 / Title Chapter Page Number ID Code <---- Filter for this
5 1234B / 01 / 2 (blank) (blank) ID Code
6 1234B / 01 / 23456 / Title Chapter Page Number ID Code <---- Filter for this
I've tried the following code:
# Filter by pattern
import pandas as pd
import numpy as np
import re
pattern = re.compile("[0-9][0-9][0-9][0-9][B][\s][/][\s][0-9][0-9][\s][/][\s][0-9][0-9][0-9][0-9][0-9][\s]+[/]")
df = df[df['a'].apply(pattern)]
Result is a TypeError: '_sre.SRE_Pattern' object is not callable. It looks like I'm applying it wrong. Also my regular expression does not have a wildcard to account for the rest of the data in column a. What is a pythonic way to filter column A to look at the first 20 characters in column A and do a pattern match on it?
Upvotes: 3
Views: 2321
Reputation: 10359
You can use the following, based on your rules given:
df = df[df['a'].str.match(r'\d{4}[a-zA-z]\s\/\s\d{2}\s\/\s\d{5}\s\/.*')]
this gives:
a b c d
4 1234B / 01 / 23456 / Title Chapter Page Number ID Code
6 1234B / 01 / 23456 / Title Chapter Page Number ID Code
Upvotes: 5