Marek Ševela
Marek Ševela

Reputation: 107

Performing multiple regex match on pandas dataframe column

I have an extraction of email body strings loaded in a dataframe under df['Body'] All I need is to perform matches with regex codes, when one of the regex matches, it should assign a category based on what regex received a match.

For that I am trying to:

  1. make a for loop going through df['Body']
  2. check the iterated string and execute regex matches
  3. based on matched regex, print into label column "regex1 matched" or "regex2 matched" etc.

I started with a simple for loop below but cant get it to work even with only one regex check and console print.

df = pd.read_excel('Body_List.xlsx', header=None, index=False, names=['Body','Label'])

regex = re.compile(r"(?s)^(?=.*\b(?:[Uu]nblock|[Ss]onderfreigabe|[Rr]elease|[Ff]reigeben|[Ff]reigabe|[Ff]reischalten|[Oo][Rr][Dd])\b).*\b([47]\d{6}|\d{7})\b")


for message in (df['Body']):
    if re.match(regex, message)== True:                   
  
        print("regex matched " + df['Label'])
    else:

        print("regex didnt matched" + df['Label']) 

What I receive after execution:

enter image description here

Could anyone please help how to make this loop work + have this implemented for multiple regex checks?

Thank you very much!

Examples of other regex checks:

release_test = re.compile('\b((4|7)\d{6})\b')
account_test = re.compile('\b((1)\d{6})\b|(\account|Account|Konto|konto|Tili|tili|účet|ucet|Compte|compte)')

Upvotes: 0

Views: 2226

Answers (1)

Marek Ševela
Marek Ševela

Reputation: 107

After using code snippet from @WiktorStribiżew as an inspiration, I was able to create 2 columns with regex checks inside of my pandas dataframe with use of np.where. This will now enable me to categorize df['Body'] contents based on each regex match.

Thank you!

Code solution:

import pandas as pd
import numpy as np
import xlsxwriter
import re


df = pd.read_excel('Regex testing 2.xlsx', header=None, index=False, names=['Sender','Subject', 'Body','Date','Time','Category', 'Folder','Regex1','Regex2'])


regex1 = re.compile(r"(?si)^(?=.*\b(?:unblock|kredit|sonderfreigabe|unlock|release|freigeben|freigabe|freischalten|ord)\b).*\b([47]\d{6}|\d{8})\b")

regex2 = re.compile(r"\b((1)\d{6})\b|(\account|Account|Konto|konto|Tili|tili|účet|ucet|Compte|compte)")


df['Regex1'] = np.where(df.Body.str.match(regex1), "Body matched regex1", "Body not matching regex1")

df['Regex2'] = np.where(df.Body.str.match(regex2), "Body matched regex1", "Body not matching regex1")


# export to excel
writer = pd.ExcelWriter(r'RegexOutput.xlsx', engine='xlsxwriter', options={'strings_to_urls': False})
df.to_excel(writer, index=False)
writer.close()

Upvotes: 2

Related Questions