Reputation: 45
Given the following data frame for instance (mind you the original data for this column is a dtype('0'))
df = pd.DataFrame({'product_description': ["CUTLERY HVY DUTY FORKS", "XYZ DISP LQD SOAP", "ABCD FOOD STRG CNTNR"]})
How can I effectively identify and separate the abbreviations and produce a result like
product_description abbreviations
0 CUTLERY HVY DUTY FORKS [HVY]
1 XYZ DISP LQD SOAP [XYZ,DISP,LQD]
2 ABCD FOOD STRG CNTNR [ABCD,STRG,CNTNR]
So i convert the abbreviations into full words.
i have tried this:
import pandas as pd
import re
df = pd.DataFrame({'product_description': ["CUTLERY HVY DUTY FORKS", "XYZ DISP LQD SOAP", "ABCD FOOD STRG CNTNR"]})
def extract_abbreviations(description):
abbreviation_pattern = r'\b[A-Z]{2,}(?![a-z])' # Updated regular expression pattern to match abbreviations
abbreviations = re.findall(abbreviation_pattern, description)
return abbreviations
df['abbreviations'] = df['product_description'].apply(extract_abbreviations)
print(df)
but this is what i get :
product_description abbreviations
0 CUTLERY HVY DUTY FORKS [CUTLERY,HVY,DUTY,FORKS]
1 XYZ DISP LQD SOAP [XYZ,DISP,LQD,SOAP]
2 ABCD FOOD STRG CNTNR [ABCD,FOOD,STRG,CNTNR]
Your help is much appreciated. Thank you
Upvotes: 0
Views: 99
Reputation: 8273
Given you have a list of abb ['XYZ', 'DISP', 'LQD', 'ABCD', 'STRG', 'CNTNR', 'HVY', 'SOAP']
You should be able the apply the below logic to obtain the desired result
abb=['XYZ', 'DISP', 'LQD', 'ABCD', 'STRG', 'CNTNR', 'HVY', 'SOAP']
def return_abb(row):
return list(set(row.split(" ")) & set(abb))
df['abbreviations']=df['product_description'].apply(return_abb)
Upvotes: 0