Python 3 Pandas filter a dataframe by regular expression and wildcard

Question

Working with a text dataset, I have an extraction that gives me irregular results in a dataframe. I am not very good with regular expressions and have never done a filter trying one so help would be appreciated.

I am trying to filter column a for rows 4 & 6. The pattern is 4 numbers, a letter, a space, / space, 2 numbers, space, /, space, 5 numbers, space, /, then whatever follows.

The dataframe looks like this:

    a                            b          c            d
0   1234B:Program Name / Title   Chapter    Page Number  ID Code
1   1234B:Program Name / Title   Chapter    Page Number  ID Code
2   1234B:Program Name / Title   Chapter    Page Number  ID Code
3   1234B / 01 / 2               (blank)    (blank)      ID Code
4   1234B / 01 / 23456 / Title   Chapter    Page Number  ID Code   <---- Filter for this
5   1234B / 01 / 2               (blank)    (blank)      ID Code
6   1234B / 01 / 23456 / Title   Chapter    Page Number  ID Code   <---- Filter for this

I've tried the following code:

# Filter by pattern
import pandas as pd
import numpy as np
import re
pattern = re.compile("[0-9][0-9][0-9][0-9][B][\s][/][\s][0-9][0-9][\s][/][\s][0-9][0-9][0-9][0-9][0-9][\s]+[/]")
df = df[df['a'].apply(pattern)]

Result is a TypeError: '_sre.SRE_Pattern' object is not callable. It looks like I'm applying it wrong. Also my regular expression does not have a wildcard to account for the rest of the data in column a. What is a pythonic way to filter column A to look at the first 20 characters in column A and do a pattern match on it?

asongtoruin · Accepted Answer

You can use the following, based on your rules given:

df = df[df['a'].str.match(r'\d{4}[a-zA-z]\s\/\s\d{2}\s\/\s\d{5}\s\/.*')]

this gives:

                            a        b            c        d
4  1234B / 01 / 23456 / Title  Chapter  Page Number  ID Code
6  1234B / 01 / 23456 / Title  Chapter  Page Number  ID Code

Python 3 Pandas filter a dataframe by regular expression and wildcard

Answers (1)

Related Questions