user3792491
user3792491

Reputation:

cosine similarity for latex code of an equation

I have extended this SO question & is comparing two latex equations. Here is two quadratic equation's example.

eqn1 = "*=\frac{-*\pm\sqrt{*^2-4ac}}{2a}"
eqn2 = "x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}"

I need to compare these as correct, because, instead of x, b, I have use * for that. All I am doing is converting equations to word list.

eqn1_word = [*,frac,*,pm,sqrt,*,2,4ac,2a]
eqn2_word = [x,frac,b,pm, sqrt, b, 2, 4ac, 2a]

so the vector is

eqn1_vec= Counter({'*': 3, 'frac': 1, 'sqrt': 1, '2': 1, '2a': 1, '4ac': 1, 'pm': 1})
eqn2_vec = Counter({'b': 2, 'frac': 1, 'sqrt': 1, '2': 1, '2a': 1, '4ac': 1, 'x': 1, 'pm': 1})

Now my extension is I am checking the percentage of * in eqn1_word, then check with normal cosine similarity as given by that answer. At last, I am adding two values, which has to nearly equal to 1.

This works fine for most of scenario(if one variable is replaced by *). Here is * value is 3 for eqn1_vec, and in eqn2_vec b = 2, x=1.

For more description & better understanding please check this. From that reference, my code is like this.

def get_cosine(self, c_eqn1_eqn, c_eqn2_eqn):
    print 'c_eqn1_eqn = ', c_eqn1_eqn
    print 'c_eqn2_eqn = ', c_eqn2_eqn
    _special_symbol = float(c_eqn1_eqn.count("*"))
    cos_result = 0
    symbol_percentage = 0
    try:
        eqn1_vector = Counter(self.get_word(c_eqn1_eqn))# get word will return word list
        eqn2_vector = Counter(self.get_word(c_eqn2_eqn))
        _words = sum([x for x in eqn1_vector.values()])
        if eqn2_vector.has_key("*"):
            _special_symbol -= eqn2_vector["*"]
        print '_special_symbol = ', _special_symbol
        print '_words @ last = ', _words
        try:
            symbol_percentage = _special_symbol / _words
        except ZeroDivisionError:
            symbol_percentage = 0.0
    except Exception as exp:
        print "Exception at converting equation to vector", exp
        traceback.print_exc()
    else:
        intersection = set(eqn1_vector.keys()) & set(eqn2_vector.keys())
        numerator = sum([eqn1_vector[x] * eqn2_vector[x] for x in intersection])
        _sum1 = sum([eqn1_vector[x]**2 for x in eqn1_vector.keys()])
        _sum2 = sum([eqn2_vector[x]**2 for x in eqn2_vector.keys()])
        denominator = math.sqrt(_sum1) * math.sqrt(_sum2)
        print 'numerator = ', numerator
        print 'denominator = ', denominator
        if not denominator:
            cos_result = 0
        else:
            cos_result = float(numerator) / denominator
        print cos_result
    final_result = float(symbol_percentage) + cos_result
    return final_result if final_result <= 1.0 else 1

The problem is numerator is getting small as intersection value is small. I have copied from my class. please ignore self.

How to solve this. Thanks in advance. If there is any mistake in question or my concept is wrong, please share with me.

Upvotes: 1

Views: 4838

Answers (1)

user3792491
user3792491

Reputation:

I got a solution for this problem.

As we can/should not increase numerator value, I decided to handle denominator instead. My logic is to decrease the denominator value if number of * and number of non intersecting value in eqn2 is same. If not then let it go as it is. Now I do not have to calculate the percentage for "*" nor adding that in cosine result.

def get_cosine(c_eqn1, c_eqn2):
    _special_symbol = float(c_eqn1.count("*"))
    cos_result = 0
    try:
        eqn1_vector = Counter(get_word(c_eqn1))
        eqn2_vector = Counter(get_word(c_eqn2))
        _special_symbol = 0
        spe_list = list()
        # Storing number of * & the value contains *
        for _val in eqn1_vector.keys():
            if _val.__contains__("*"):
                _special_symbol += eqn1_vector[_val]
                spe_list.append(_val)
        if eqn2_vector.has_key("*"):
            _special_symbol -= eqn2_vector["*"]
    except Exception as exp:
        print "Exception at converting equation to vector", exp
        traceback.print_exc()
    else:
        intersection = set(eqn1_vector.keys()) & set(eqn2_vector.keys())
        numerator = sum([eqn1_vector[x] * eqn2_vector[x]
                         for x in intersection])
        non_intersection_sum = 0
        non_intersection_value = list()
        # storing no of non_matched value
        for _val in eqn2_vector.keys():
            if _val not in intersection:
                non_intersection_sum += eqn2_vector[_val]
                non_intersection_value.append(_val)
        # Join both non intercet lists
        if non_intersection_value:
            non_intersection_value.extend(spe_list)
        # If both non intersect value are not same
        # Empty the list
        if _special_symbol != non_intersection_sum:
            non_intersection_value = list()
        # Cosine similarity formula
        _sum1 = sum([eqn1_vector[x]**2 for x in eqn1_vector.keys() if x not in non_intersection_value])
        _sum2 = sum([eqn2_vector[x]**2 for x in eqn2_vector.keys() if x not in non_intersection_value])
        denominator = math.sqrt(_sum1) * math.sqrt(_sum2)
        if not denominator:
            cos_result = 0
        else:
            cos_result = float(numerator) / denominator
    return cos_result if cos_result <= 1.0 else 1

Upvotes: 1

Related Questions