Reputation: 107
I have an extraction of email body strings loaded in a dataframe under df['Body']
All I need is to perform matches with regex codes, when one of the regex matches, it should assign a category based on what regex received a match.
For that I am trying to:
df['Body']
I started with a simple for loop below but cant get it to work even with only one regex check and console print.
df = pd.read_excel('Body_List.xlsx', header=None, index=False, names=['Body','Label'])
regex = re.compile(r"(?s)^(?=.*\b(?:[Uu]nblock|[Ss]onderfreigabe|[Rr]elease|[Ff]reigeben|[Ff]reigabe|[Ff]reischalten|[Oo][Rr][Dd])\b).*\b([47]\d{6}|\d{7})\b")
for message in (df['Body']):
if re.match(regex, message)== True:
print("regex matched " + df['Label'])
else:
print("regex didnt matched" + df['Label'])
What I receive after execution:
Could anyone please help how to make this loop work + have this implemented for multiple regex checks?
Thank you very much!
Examples of other regex checks:
release_test = re.compile('\b((4|7)\d{6})\b')
account_test = re.compile('\b((1)\d{6})\b|(\account|Account|Konto|konto|Tili|tili|účet|ucet|Compte|compte)')
Upvotes: 0
Views: 2226
Reputation: 107
After using code snippet from @WiktorStribiżew as an inspiration, I was able to create 2 columns with regex checks inside of my pandas dataframe with use of np.where
. This will now enable me to categorize df['Body']
contents based on each regex match.
Thank you!
Code solution:
import pandas as pd
import numpy as np
import xlsxwriter
import re
df = pd.read_excel('Regex testing 2.xlsx', header=None, index=False, names=['Sender','Subject', 'Body','Date','Time','Category', 'Folder','Regex1','Regex2'])
regex1 = re.compile(r"(?si)^(?=.*\b(?:unblock|kredit|sonderfreigabe|unlock|release|freigeben|freigabe|freischalten|ord)\b).*\b([47]\d{6}|\d{8})\b")
regex2 = re.compile(r"\b((1)\d{6})\b|(\account|Account|Konto|konto|Tili|tili|účet|ucet|Compte|compte)")
df['Regex1'] = np.where(df.Body.str.match(regex1), "Body matched regex1", "Body not matching regex1")
df['Regex2'] = np.where(df.Body.str.match(regex2), "Body matched regex1", "Body not matching regex1")
# export to excel
writer = pd.ExcelWriter(r'RegexOutput.xlsx', engine='xlsxwriter', options={'strings_to_urls': False})
df.to_excel(writer, index=False)
writer.close()
Upvotes: 2