Reputation: 794
I am a complete beginner to machine learning so excuse me for the general question.
I'm trying to map column names from random datasets to the columns of a known dataset. For example the column name 'image_link' would need to match the random dataset where the same column name could be 'Image Link' and then another could be 'img_url'.
I have a large dataset of the different variations for each column name
I believe machine learning is something that could help with this and have started looking into this. I have done some machine learning with python, mainly linear regression which i feel doesn't suit this problem.
I have done quite a lot of research via google to see if I can get any examples of something similar but I'm not able to find much. Can anyone help me and advise if this is even something I should be solving using machine learning and if so, is there any particular machine learning techniques that might fit this problem so I know what direction to go in with my research.
Any help would be appreciated.
EDIT**
After a bit more research I kind of feel like a classifier is the way to go maybe using SVM or Naive Bayes?
I also have created a very basic data set but what would be the best way to prepare this kind of data for processing?
--------------------------------------------------
| **Category** | **Term** |
--------------------------------------------------
| id | SKU |
--------------------------------------------------
| id | id |
--------------------------------------------------
| id | productID |
--------------------------------------------------
| link | productLink |
--------------------------------------------------
| link | URL |
--------------------------------------------------
| link | link |
--------------------------------------------------
| image_large | Image |
--------------------------------------------------
| image_large | ImageMedium |
--------------------------------------------------
| image_large | image_link |
--------------------------------------------------
| image_thumb | ImageSmall |
--------------------------------------------------
| image_thumb | Image |
--------------------------------------------------
| image_thumb | image link |
--------------------------------------------------
Upvotes: 4
Views: 1514
Reputation: 23
If you have (or can create) a training set mapping many examples of these 'wild' field names to the standard field name you want to map them to, you could also implement a machine learning solution (supervised multi-class text classification). In your case the 'wild' field names will be your predictive variable and the standard field name would be the target field you are trying to predict.
Here is a simple implementation in python/sklearn, but just google "supervised multi-class text classification" and I am sure you will find lots of tutorials and explanations that will help.
Upvotes: 1
Reputation: 417
I think you may use the Levenshtein distance, that measure the difference or distance between words and phrases. There are many implementation in python and R. You may assing the unkwon column name to the known key that is closer or some similar rule.
You may also check here
Upvotes: 1