Reputation: 563
I have a database of thousands of different colours. I want to map them to one of the colours I have in a list.
Before this database of colours was only a few hundred and I managed this with something like the code below. This is now getting unmaintainable as this database of unclassified colours is growing and takes me a lot of time every week to map.
How can I improve this or what would be a better approach?
mapped_colours = ['Red', 'Green', 'Yellow', 'Blue', 'White', 'Black', 'Pink', 'Purple'...]
colour_map_dict = {
'olive': 'Green',
'khaki': 'Green'
}
def classify_colour(colour):
for mp in mapped_colours:
if mp.lower() in colour.lower():
return mp
for map, colour in colour_map_dict.items():
if map in colour.lower():
return colour
Here is an example of the data coming in.
Resin Dark Wash Indi
Filtered Canyon
999 Black
Winter White/Dove Grey
Midnight/min
White & black
Green/White
Red/White
Multicolor
royal blue
Black Plum Grey
Rose/ Gold
Red And White
Offwht/Gg
Black Gunmetal
Berry/Black
Caramel
Blue Stone Bleached
All Tan
Pale Blush
Tee
White / Multi
00-black
Flat Foundation
Baby Blue
Beige Melange
Upvotes: 3
Views: 1059
Reputation: 19695
Once you have a large database of names to correct answers (see Martijn's answer), you could use that database to train a classification algorithm, for example one from scikit-learn:
#!/usr/bin/env python3
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
mapped_colours = ['Red', 'Green', 'Yellow', 'Blue', 'White', 'Black', 'Pink', 'Purple']
colour_map = [
('olive', 'Green'),
('khaki', 'Green'),
('snow white', 'White'),
('alice white', 'White'),
('pale blush', 'Pink'),
('baby blue', 'Blue'),
('midnight', 'Blue'),
# ...and so on and so on - you'll need a lot of these
]
# A classifier classifies inputs into categories (colors in this case)
clf = svm.SVC(gamma=0.001, C=100.)
# A vectorizer turns strings into arrays which can be used as input
vectorizer = CountVectorizer()
# Train both the classifier and the vectorizer. This can take some time.
training = vectorizer.fit_transform([k for (k, v) in colour_map])
clf.fit(training, [mapped_colours.index(v) for (k, v) in colour_map])
# Predict some colors!
while True:
query = input('Enter a color: ')
guess = clf.predict(vectorizer.transform([query]))[0]
print('Maybe', mapped_colours[guess])
Example run:
Enter a color: snow
Maybe White
Enter a color: dark khaki
Maybe Green
Enter a color: baby bedroom
Maybe Blue
You could alternatively have your model try to predict a RGB color, if your input data is already in RGB form, and work form there.
Because of the very short input, the classifier will likely not get very smart, but if the database is large enough it could perhaps make the job of adding colors a bit easier: if the classifier guesses correctly, just add its guess as a color. If not, you will still need to manually classify it, but the classifier will pick up the correct answer in future runs.
Disclaimer: I'm not sure if SVC is a right fit (heh) for your problem, but it might be Good Enough and worth a try.
Upvotes: 1
Reputation: 1124548
I'd start with a decent colour dictionary to map names to colour definitions in a given colour space (like RGB or CMYK or HSV). There are various sets available on the internet; you'll have to do work up-front to obtain them and normalise the data from each to use the same colour space. The more sources your can obtain, the richer your mapping; you appear to have a load of fashion colours (paint? cloth?) in your input set, and (commercial) fashion is forever trying to differentiate by inventing new colour names.
Because a colour space is finite, you can then algorithmically partition that space into a limited set of groups. Each colour name then automatically will map to a given group.
Looking around a bit, a good starting point would be the Wikipedia lists of colour names. The compact list should be easily machine parseable, even in the basic HTML form, or you can use the MediaWiki API to get a raw format that's even easier to parse. Then perhaps add other standardised colour name dictionaries; the goal here is to get as many names as possible all mapping to the same colour space.
I'd store these names in a database table, and have a simple mathematical formula ready to divide the colour space into your basic groups. That way any colour in the table can be mapped to (say) RGB, and RGB to simple name.
Next, build a simple spell-checker trained on your database of names, and run your input through that first. You have some pretty hard-to-work-with data there, but a trained colour name spell checker can probably clean up Offwht/Gg to something that can be matched. And use the natural text search to find partial matches.
Note that if you have image data with those colour names you receive, you'd find the most prevalent colour in that image, and then you have another name (from your input data) -> colour space mapping to use.
Upvotes: 3