kfp_ny
kfp_ny

Reputation: 39

Making string column consistent/clean in pandas

I'm working with a dataset that has "unclean" string columns. These are company names, and most of these were entered in manually, so there are typos and different forms of representation. The dataset column looks something like this:

company_name
big compnay
big company
big company inc.
smll compny
small company
small inc.

I am trying to edit the above column to something like below:

company_name
big company
big company
big company
small company
small company
small company

The number of datapoints is much larger than what can be cleaned manually. I would really appreciate any suggestions/help/advice. I've tried working with modules such as fuzzywuzzy, but I couldn't figure out the best way to deal with the problem above.

Thanks.

Upvotes: 0

Views: 698

Answers (1)

Sharan N
Sharan N

Reputation: 648

You can utilize a probabilistic spell corrector to correct words with one or two edit distances from a word with much higher frequency of occurrence in your dataset. A Python implementation is provided here: http://norvig.com/spell-correct.html

Upvotes: 1

Related Questions