Identify abbreviations in a string column

Question

Given the following data frame for instance (mind you the original data for this column is a dtype('0'))

df = pd.DataFrame({'product_description': ["CUTLERY HVY DUTY FORKS", "XYZ DISP LQD SOAP", "ABCD FOOD STRG CNTNR"]})

How can I effectively identify and separate the abbreviations and produce a result like

product_description            abbreviations
0  CUTLERY HVY DUTY FORKS        [HVY]

1  XYZ DISP LQD SOAP             [XYZ,DISP,LQD]

2  ABCD FOOD STRG CNTNR          [ABCD,STRG,CNTNR]

So i convert the abbreviations into full words.

i have tried this:

import pandas as pd
import re

df = pd.DataFrame({'product_description': ["CUTLERY HVY DUTY FORKS", "XYZ DISP LQD SOAP", "ABCD FOOD STRG CNTNR"]})

def extract_abbreviations(description):
    abbreviation_pattern = r'\b[A-Z]{2,}(?![a-z])'  # Updated regular expression pattern to match abbreviations
    abbreviations = re.findall(abbreviation_pattern, description)
    return abbreviations

df['abbreviations'] = df['product_description'].apply(extract_abbreviations)
print(df)

but this is what i get :

product_description            abbreviations
0  CUTLERY HVY DUTY FORKS        [CUTLERY,HVY,DUTY,FORKS]

1  XYZ DISP LQD SOAP             [XYZ,DISP,LQD,SOAP]

2  ABCD FOOD STRG CNTNR          [ABCD,FOOD,STRG,CNTNR]

Your help is much appreciated. Thank you

Identify abbreviations in a string column

Answers (1)

Related Questions