Reputation: 1112
I have a dataset where each row is a specific compliance violation. The first column is the name of the violation (df['Violations'] Fire Exit, Aisle, Ergonomic Seats..up to 130 violations), the second column represents the gravity of the violation (df['Category'] Minor, Medium, Major, Critical), the 3rd the description of the violation (df['Description'] 1-2 sentence describing the issue).
Each violation (e.g. Aisle) present different issues (an aisle is too small vs an aisle is just obstruct). I want to classify my violations according to the violation description. E.g. I would like that the following two violation descriptions were classified under the same new category (obstruction) :
'It is recommended that the factory should protect all aisles from any obstruction to ensure emergency evacuation and to ensure that all evacuation passages and emergency exits are clear at all times.'
and
"It is recommended that the factory should protect all aisles from any obstruction to ensure emergency evacuation and to ensure that all evacuation passages and emergency exits are clear at all times and provide proper fire safety training to workers conduct regular health & safety inspection"
I know there are particular keywords I could look for (e.g. obstruction), but it would take me quite a bit to identify keywords for each violation category (I have more than 130 violations category).
What kind of processing language analysis can I run to have python automatically identify different 'clusters' for different categories? Any suggestion for Python?
EDIT:
I added a pic of the data
Upvotes: 0
Views: 85
Reputation: 396
it would take me quite a bit to identify keywords for each violation category
This is called Topic Modeling task and you can achieve this using Latent Dirichlet Allocation (LDA) which will automatically form text clusters for you. LDA considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.
Since you haven't shared the dataset, I would point you to this excellent resource. You can also get visualizations such as these.
Upvotes: 1