categorizing different texts for Python

Question

I have a dataset where each row is a specific compliance violation. The first column is the name of the violation (df['Violations'] Fire Exit, Aisle, Ergonomic Seats..up to 130 violations), the second column represents the gravity of the violation (df['Category'] Minor, Medium, Major, Critical), the 3rd the description of the violation (df['Description'] 1-2 sentence describing the issue).

Each violation (e.g. Aisle) present different issues (an aisle is too small vs an aisle is just obstruct). I want to classify my violations according to the violation description. E.g. I would like that the following two violation descriptions were classified under the same new category (obstruction) :

'It is recommended that the factory should protect all aisles from any obstruction to ensure emergency evacuation and to ensure that all evacuation passages and emergency exits are clear at all times.'

and

"It is recommended that the factory should protect all aisles from any obstruction to ensure emergency evacuation and to ensure that all evacuation passages and emergency exits are clear at all times and provide proper fire safety training to workers conduct regular health & safety inspection"

I know there are particular keywords I could look for (e.g. obstruction), but it would take me quite a bit to identify keywords for each violation category (I have more than 130 violations category).

What kind of processing language analysis can I run to have python automatically identify different 'clusters' for different categories? Any suggestion for Python?

EDIT:

I added a pic of the data

categorizing different texts for Python

Answers (1)

Related Questions