iRoygbiv
iRoygbiv

Reputation: 875

Classifying text based information (i.e. implementing a string kernel)

I need to run a clustering algorithm on a set of consumer data but I am not sure how to handle text based fields (perfect example being alpha numeric post codes, such as SE1 8XR).

Apparently what I need to use is a string kernel, which I understand the basic idea of, but not well enough to implement successfully.

Ideally I would like the new numeric vector to encode the idea that the more dissimilar two post codes are the further the data points are from each other, but have no idea how to do this and I've not been able to find a single useful tutorial, guide or textbook!

Also I am doing this in Python, in case there is some library anyone knows of that may be useful.

Upvotes: 0

Views: 371

Answers (1)

Jon Clements
Jon Clements

Reputation: 142216

re postcodes

You can't compare postcodes as strings. `AL1 1AA' is St. Albans, and 'AB1 1AA' is Aberdeen. They're remarkably close in edit distance, but CR6 7DX is closer to St. Albans :)

Your best approach is probably to grab some lookup table (I know you can buy them off Royalmail) perhaps from http://www.ordnancesurvey.co.uk/oswebsite/products/os-opendata.html which takes a postcode, or at least a sector 'AL1 1' (maybe even district 'AL1') for instance, and maps that to lat/lon, which you then use to bucket data.

other strings

One possible option would be to use difflib.SequenceMatcher which returns a percentage score of how similar two strings are to each other (there are plenty of other algorithms out there: google "genetic string algorithms", "fuzzy string matching", "string similarity algorithms" etc...). Group all strings that are (maybe) 80% similar to each other, and assign those a group - then cluster on that group.

You may also find metaphone & double metaphone (and possibly ntlk) useful depending on the complexity of your requirements and data.

Upvotes: 2

Related Questions