rom
rom

Reputation: 666

What is the best way to get accurate text similarity in python for comparing single words or bigrams?

I've got similar product data in both the products_a array and products_b array:

products_a = [{color: "White", size: "2' 3\""}, {color: "Blue", size: "5' 8\""} ]
products_b = [{color: "Black", size: "2' 3\""}, {color: "Sky blue", size: "5' 8\""} ]

I would like to be able to accurately tell similarity between the colors in the two arrays, with a score between 0 and 1. For example, comparing "Blue" against "Sky blue" should be scored near 1.00 (probably like 0.78 or similar).

Spacy Similarity

I tried using spacy to solve this:

import spacy
nlp = spacy.load('en_core_web_sm')

def similarityscore(text1, text2 ):
    doc1 = nlp( text1 )
    doc2 = nlp( text2 )
    similarity = doc1.similarity( doc2 )
    return similarity

Yeah, well when passing in "Blue" against "Sky blue" it scores it as 0.6545742918773636. Ok, but what happens when passing in "White" against "Black"? The score is 0.8176945362451089... as in spacy is saying "White" against "Black" is ~81% similar! This is a failure when trying to make sure product colors are not similar.

Jaccard Similarity

I tried Jaccard Similarity on "White" against "Black" using this and got a score of 0.0 (maybe overkill on single words but room for future larger corpuses):

# remove punctuation and lowercase all words function
def simplify_text(text):
    for punctuation in ['.', ',', '!', '?', '"']:
        text = text.replace(punctuation, '')
    return text.lower()

# Jaccard function
def jaccardSimilarity(text_a, text_b ):
    word_set_a, word_set_b = [set(self.simplify_text(text).split())
                                for text in [text_a, text_b]]
    num_shared = len(word_set_a & word_set_b)
    num_total = len(word_set_a | word_set_b)
    jaccard = num_shared / num_total
    return jaccard

Getting differing scores of 0.0 and 0.8176945362451089 on "White" against "Black" is not acceptable to me. I keep seeking a more accurate way of solving this issue. Even taking the mean of the two would be not accurate. Please let me know if you have any better ways.

Upvotes: 7

Views: 4201

Answers (5)

Kim
Kim

Reputation: 1686

NLP packages may be better at longer text fragments and more sophisticated text analysis.

As you've discovered with 'black' and 'white', they make assumptions about similarity that are not right in the context of a simple list of products.

Instead you can see this not as an NLP problem, but as a data transformation problem. This is how I would tackle it.

To get the unique list of colors in both lists use set operations on the colors found in the two product lists. "set comprehensions" get a unique set of colors from each product list, then a union() on the two sets gets the unique colors from both product lists, with no duplicates. (Not really needed for 4 products, but very useful for 400, or 4000.)

products_a = [{'color': "White", 'size': "2' 3\""}, {'color': "Blue", 'size': "5' 8\""} ]
products_b = [{'color': "Black", 'size': "2' 3\""}, {'color': "Sky blue", 'size': "5' 8\""} ]

products_a_colors = {product['color'].lower() for product in products_a}
products_b_colors = {product['color'].lower() for product in products_b}
unique_colors = products_a_colors.union(products_b_colors)
print(unique_colors)

The colors are lowercased because in Python 'Blue' != 'blue' and both spellings are found in your product lists.

The above code finds these unique colors:

{'black', 'white', 'sky blue', 'blue'}

The next step is to build an empty color map.

colormap = {color: '' for color in unique_colors}
import pprint
pp = pprint.PrettyPrinter(indent=4, width=10, sort_dicts=True)
pp.pprint(colormap)

Result:

{
    'sky blue': '',
    'white': '',
    'black': '',
    'blue': ''
}

Paste the empty map into your code and fill out mappings for your complex colors like 'Sky blue'. Delete simple colors like 'white', 'black' and 'blue'. You'll see why below.

Here's an example, assuming a slightly bigger range of products with more complex or unusual colors:

colormap = {
    'sky blue': 'blue',
    'dark blue': 'blue',
    'bright red': 'red',
    'dark red': 'red',
    'burgundy': 'red'
}

This function helps you to group together colors that are similar based on your color map. Function color() maps complex colors onto base colors and drops everything into lower case to allow 'Blue' to be considered the same as 'blue'. (NOTE: the colormap dictionary should only use lowercase in its keys.)

def color(product_color):
    return colormap.get(product_color.lower(), product_color).lower()

Examples:

>>> color('Burgundy')
'red'
>>> color('Sky blue')
'blue'
>>> color('Blue')
'blue'

If a color doesn't have a key in the colormap, it passes through unchanged, except that it is converted to lowercase:

>>> color('Red')
'red'
>>> color('Turquoise')
'turquoise'

This is the scoring part. The product function from the standard library is used to pair items from product_a with items from product_b. Each pair is numbered using enumerate() because, as will become clear later, a score for a pair is of the form (pair_id, score). This way each pair can have more than one score.

'cartesian product' is just a mathematical name for what itertools.product() does. I've renamed it to avoid confusion with product_a and product_b. itertools.product() returns all possible pairs between two lists.

from itertools import product as cartesian_product
product_pairs = {
    pair_id: product_pair for pair_id, product_pair
    in enumerate(cartesian_product(products_a, products_b))
}
print(product_pairs)

Result:

{0: ({'color': 'White', 'size': '2\' 3"'}, {'color': 'Black', 'size': '2\' 3"'}),
 1: ({'color': 'White', 'size': '2\' 3"'}, {'color': 'Sky blue', 'size': '5\' 8"'}),
 2: ({'color': 'Blue', 'size': '5\' 8"'}, {'color': 'Black', 'size': '2\' 3"'}),
 3: ({'color': 'Blue', 'size': '5\' 8"'}, {'color': 'Sky blue', 'size': '5\' 8"'})
}

The list will be much longer if you have 100s of products.

Then here's how you might compile color scores:

color_scores = [(pair_id, 0.8) for pair_id, (product_a, product_b)
                in product_pairs.items()
                if color(product_a['color']) == color(product_b['color'])]
print(color_scores)

In the example data, one product pair matches via the color() function: pair number 3, with the 'Blue' product in product_a and the 'Sky blue' item in product_b. As the color() function evaluates both 'Sky blue' and 'blue' to the value 'blue', this pair is awarded a score, 0.8:

[(3, 0.8)]

"deep unpacking" is used to extract product details and the "pair id" of the current product pair, and put them in local variables for processing or display. There's a nice tutorial article about "deep unpacking" here.

The above is a blueprint for other rules. For example, you could write a rule based on size, and give that a different score, say, 0.5:

size_scores = [(pair_id, 0.5) for pair_id, (product_a, product_b)
               in product_pairs.items()
               if product_a['size'] == product_b['size']]
print(size_scores)

and here are the resulting scores based on the 'size' attribute.

[(0, 0.5), (3, 0.5)]

This means pair 0 scores 0.5 and pair 3 scores 0.5 because their sizes match exactly.

To get the total score for a product pair you might average the color and size scores:

print()
print("Totals")
score_sources = [color_scores, size_scores]  # add more scores to this list
all_scores = sorted(itertools.chain(*score_sources))
pair_scores = itertools.groupby(all_scores, lambda x: x[0])
for pair_id, pairs in pair_scores:
    scores = [score for _, score in pairs]
    average = sum(scores) / len(scores)
    print(f"Pair {pair_id}: score {average}")
    for n, product in enumerate(product_pairs[pair_id]):
        print(f"  --> Item {n+1}: {product}")

Results:

Totals
Pair 0: score 0.5
  --> Item 1: {'color': 'White', 'size': '2\' 3"'}
  --> Item 2: {'color': 'Black', 'size': '2\' 3"'}
Pair 3: score 0.65
  --> Item 1: {'color': 'Blue', 'size': '5\' 8"'}
  --> Item 2: {'color': 'Sky blue', 'size': '5\' 8"'}

Pair 3, which matches colors and sizes, has the highest score and pair 0, which matches on size only, scores lower. The other two pairs have no score.

Upvotes: 2

CasseroleBoi
CasseroleBoi

Reputation: 106

You want to frist convert the color name to hex, and then compare two hex values. Do not compare strings!

import math
from difflib import SequenceMatcher
from matplotlib import colors
COLOR_NAMES = list(colors.CSS4_COLORS.keys()) #choose any color module you want and get a list of all colors

def hexFromColorName(name):
    name = name.lower() #matplotlib names are lowercase
    closest_match = [0, ""]
    for colorname in COLOR_NAMES:
        sim = SequenceMatcher(None, name, colorname).ratio()
        #print("Similarity between two strings is: " + str(sim) )
        if sim > closest_match[0]:
            closest_match = sim, colorname

    #use maptplotlib's color conversion dictionary to get hex values
    return colors.CSS4_COLORS[closest_match[1]]
    

def compareRGB(color1, color2):
    color1, color2 = color1[1:], color2[1:] #trim the # from hex color codes

    #convert from hex string to decimal tuple
    color1 = (int(color1[:2], base=16), int(color1[2:4], base=16), int(color1[4:], base=16))  
    color2 = (int(color2[:2], base=16), int(color2[2:4], base=16), int(color2[4:], base=16))

    #standard euclidean distance between two points in space
    dist =  math.sqrt(
                        math.pow((color1[0]-color2[0]), 2) +
                        math.pow((color1[1]-color2[1]), 2) +
                        math.pow((color1[2]-color2[2]), 2) 
                     )/255/math.sqrt(3)      
    if dist > 1: dist = 1
    return 1 - dist
>>> compareRGB(hexFromColorName('dark green'),hexFromColorName('green'))
0.9366046763242764
>>> compareRGB(hexFromColorName('Light Blue'),hexFromColorName('Black'))
0.18527897735531407

Upvotes: 2

Kum_R
Kum_R

Reputation: 368

Gensim has a Python implementation of Word2Vec which provides word similarity

from gensim.models import Word2Vec
model = Word2Vec.load(path/to/your/model)
model.wv.similarity('Chennai', 'London')

Upvotes: 1

tezenko
tezenko

Reputation: 31

I found some methods that might be helpful.I am new to programming so don't really know how to implement your data set. Still wanted to share it.

from difflib import SequenceMatcher
#https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc

s1 = "blue"
s2 = "sky blue"
sim = SequenceMatcher(None, s1, s2).ratio()
print("Similarity between two strings is: " + str(sim) )

this code says Similarity between two strings is: 0.6666666666666666. I tried the same code for black and white. It says similarity between two strings is: 0.0

Note: I think Sklearn modules Affinity Propagation and Levenstein distance might be helpful.but dont know how to implement them to your questions.

Upvotes: 2

polm23
polm23

Reputation: 15593

If your actual goal here is to handle colors on product descriptions you should treat this as a classification problem, though note that for short text this is going to be very hard. Luckily most items should use common colors so it shouldn't be hard to get good coverage. I suspect picking 12 or so colors and classifying into them would be easier than making good color name embeddings.

I would not use string distance metrics like Jaccard Distance. They just tell you how many of the letters or word-chunks are the same between two strings, they don't do anything with meaning.

As mentioned in the comments, normal word vectors won't find opposites for you. You can read more about why this is hard here. The advice of working with color name word embeddings is very good, and is the best way to get a similarity score.

Upvotes: 1

Related Questions