Lazhar
Lazhar

Reputation: 1458

Function to calculate a weighed score from MySQL data?

I've got two tables (topics and terms) with a third table for my many-to-many relations between the two entities.

Each relation, called a bagging, has a source (text) and a weight (int between 0 and 100). A same pair (topic-term) can have several bagging (difference sources) and each has a different weight.

Now, when I query a topic to find out what are its best terms (more weight), I ideally want unique values with a calculated weight:

Here is the database schema:

| TOPIC
+-------+------------------+------+-----+---------+----------------+
| Field | Type             | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| id    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| label | varchar(255)     | NO   | UNI | NULL    |                |
| wtext | varchar(40)      | YES  |     | NULL    |                |
+-------+------------------+------+-----+---------+----------------+

| TERM
+-------+---------------------+------+-----+---------+----------------+
| Field | Type                | Null | Key | Default | Extra          |
+-------+---------------------+------+-----+---------+----------------+
| id    | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| label | varchar(255)        | NO   | UNI | NULL    |                |
| slug  | varchar(255)        | NO   |     | NULL    |                |
+-------+---------------------+------+-----+---------+----------------+

| BAGGING
+----------+---------------------+------+-----+---------+----------------+
| Field    | Type                | Null | Key | Default | Extra          |
+----------+---------------------+------+-----+---------+----------------+
| id       | int(10) unsigned    | NO   | PRI | NULL    | auto_increment |
| topic_id | int(11) unsigned    | NO   | MUL | NULL    |                |
| term_id  | bigint(11) unsigned | NO   | MUL | NULL    |                |
| weight   | tinyint(1) unsigned | NO   |     | NULL    |                |
| source   | varchar(8)          | YES  |     | GEN     |                |
+----------+---------------------+------+-----+---------+----------------+

And this is my simple query:

SELECT 
    bagging.topic_id as topic_id,
    topic.label as topic_label,
    bagging.term_id as term_id,
    term.label as term_label,
    bagging.weight as weight,
    bagging.source as source
FROM
    bagging
JOIN term   ON term.id  = bagging.term_id
JOIN topic  ON topic.id = bagging.topic_id
WHERE
    bagging.topic_id = ( SELECT id FROM topic WHERE label = 'Altruism' )
ORDER BY
    bagging.weight DESC

Which gives me the below results:

+----------+-------------+---------+-----------------------+--------+--------+
| topic_id | topic_label | term_id | term_label            | weight | source |
+----------+-------------+---------+-----------------------+--------+--------+
|        8 | Altruism    |      83 | Altruism              |    100 | TOPIC  |
+----------+-------------+---------+-----------------------+--------+--------+
|        8 | Altruism    |     100 | Altruism (philosophy) |     95 | WPRD   |
|        8 | Altruism    |     100 | Altruism (philosophy) |     95 | MAN    |
|        8 | Altruism    |      84 | Truist                |     95 | MAN    |
|        8 | Altruism    |      84 | Truist                |     15 | WPRD   |
+----------+-------------+---------+-----------------------+--------+--------+
|        8 | Altruism    |      94 | Selfless action       |     95 | WPRD   |
|        8 | Altruism    |      95 | Alturism              |     95 | WPRD   |
|        8 | Altruism    |      96 | Digital altruism      |     95 | WPRD   |
|        8 | Altruism    |      97 | Selflessly            |     95 | WPRD   |
|        8 | Altruism    |      98 | Altruistical          |     95 | WPRD   |
|        8 | Altruism    |      99 | Law of mutual aid     |     95 | WPRD   |
|        8 | Altruism    |     101 | Altruistically        |     95 | WPRD   |
|        8 | Altruism    |      85 | Altruistic            |     95 | WPRD   |
|        8 | Altruism    |      86 | Altruist              |     95 | WPRD   |
|        8 | Altruism    |      87 | Otherism              |     95 | WPRD   |
|        8 | Altruism    |      88 | Unselfishness         |     95 | WPRD   |
|        8 | Altruism    |      89 | Altruistic behavior   |     95 | WPRD   |
|        8 | Altruism    |      90 | Altutrists            |     95 | WPRD   |
|        8 | Altruism    |      91 | Altruists             |     95 | WPRD   |
|        8 | Altruism    |     102 | Pathological altruism |     95 | WPRD   |
+----------+-------------+---------+-----------------------+--------+--------+

Now, how to create a scoring function that would take the below into consideration for this particular example:

The end result doesn't have to be scaled from 1 to 100, it can be a relative or abstract rating that takes into account these constraints.

I tried by calculating, for each row, ( term_sum_weight * 100 / topic_weight_sum_of_all_terms ) but see the below results, they aren't weighed enough.

Results computed on spreadsheets to try. How can I give a lot more weight for 95 to 96, than 1 to 20, for example?

The formula matters more than the language that will be used... MySQL or in Python/PHP in my programs.

EXPECTED RESULT (along those lines...)

+----------+-------------+---------+-----------------------+-------+--------+
| topic_id | topic_label | term_id | term_label            | score | source |
+----------+-------------+---------+-----------------------+-------+--------+
|        8 | Altruism    |      83 | Altruism              |     1 | TOPIC  |
+----------+-------------+---------+-----------------------+-------+--------+
|        8 | Altruism    |     100 | Altruism (philosophy) |  0.98 | WPRD   |
|        8 | Altruism    |      84 | Truist                |  0.96 | MAN    |
+----------+-------------+---------+-----------------------+--------+-------+
|        8 | Altruism    |      94 | Selfless action       |  0.95 | MAN    |
|        8 | Altruism    |      95 | Alturism              |  0.95 | MAN    |
|        8 | Altruism    |      96 | Digital altruism      |  0.95 | MAN    |
                                ...........
|        8 | Altruism    |      97 | Selflessly            |  0.95 | MAN    |
|        8 | Altruism    |      90 | Altutrists            |  0.95 | MAN    |
|        8 | Altruism    |      91 | Altruists             |  0.95 | MAN    |
|        8 | Altruism    |     102 | Pathological altruism |  0.95 | MAN    |
+----------+-------------+---------+-----------------------+--------+-------+

Upvotes: 0

Views: 448

Answers (1)

MvG
MvG

Reputation: 60958

You should start with working out some properties that your desired formula should have. Some likely assumptions might be the following:

  1. A bagging of weight 100 always results in the maximal score (be it 1 or 100)
  2. Adding a bagging with weight zero changes nothing
  3. Adding a bagging with non-zero weight increases the score of the pair, unless the score is already at maximum
  4. If two pairs have a single bagging each, then greater weight results in greater score

Conditions 2 and 3 suggest something like addition. But condition 1 tells you that you can't just add, as doing so would exceed the score. One way to view this would be using a physical analogon. Think of your weights as speeds. In everyday life, you can simply add speeds. But at very high speeds, special relativity tells us that we can never exceed the limit of light speed. Adding two speeds which are lower than the speed of light does result in a speed which is higher but still lower than the speed of light. In your setup, the “speed of light” is the maximal wight of 100.

So look up the formula for the addition of speeds and adapt it to your use case. If you have two baggings of weight v and w then the total weight would be

(v + w)/(1 + v*w/10000)

Now you need to either work out a formula which does the above for arbitrary number of summands in a single formula, or you write some application code to incrementally compute the sum between an accumulator and the next item from your database. Or you read on in Wikipedia and find that the following equation holds:

c-s   c-v   c-w
--- = --- * ---
c+s   c+v   c+w

Here s is the sum of speeds, or the score in your world. But the larger s becomes, the smaller this fraction will be. So instead of sorting by s in descending order, you may sort by (c-s)/(c+s) in ascending order. Unfortunately MySQL doesn't have a PRODUCT aggregate function. But you can use logarithms to turn products into sums:

SUM(LOG((100-weight)/(100+weight))) AS score

Unfortunately, MySQL doesn't deal with infinities the way IEEE floating point arithmetic does (i.e. log(0) = -∞), but turns a weight of 100 into a score of NULL. So you either sort NULL scores before non-NULL ones using SORT BY (score IS NULL) DESC or similar. Or you turn the above back into the value s in a way which handles zeros correctly.

SELECT …,
  IF(MAX(weight) = 100, 1,
    (1-EXP(SUM(LOG((100-weight)/(100+weight)))))/
    (1+EXP(SUM(LOG((100-weight)/(100+weight)))))) AS score

Here you can think of the EXP(SUM(LOG(…))) as PRODUCT(…). To solve (c-s)/(c+s)=p for s (with p as the product I just mentioned) you'd compute s=c*(1-p)/(1+p). So by leaving out the c* there you obtain a score in the range 0 to 1 instead of 0 to 100, similar to what you did in your calculations so far. The MAX(weight) = 100 case is what would result in a LOG(0) making the whole computation NULL, so that has to be handled separately.

See http://sqlfiddle.com/#!9/1cd56/1 for an example using your data. It returns 1 for Altruism, 0.9987 for Altruism (philosophy), 0.9628 for Truism and 0.95 for the others. The score for the second is considerably larger than what you expected in your question, but I don't have a good idea to tweak the formula to be more like what you expected.

Upvotes: 2

Related Questions