Reputation: 1458
I've got two tables (topics and terms) with a third table for my many-to-many relations between the two entities.
Each relation, called a bagging, has a source (text) and a weight (int between 0 and 100). A same pair (topic-term) can have several bagging (difference sources) and each has a different weight.
Now, when I query a topic to find out what are its best terms (more weight), I ideally want unique values with a calculated weight:
Here is the database schema:
| TOPIC
+-------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| label | varchar(255) | NO | UNI | NULL | |
| wtext | varchar(40) | YES | | NULL | |
+-------+------------------+------+-----+---------+----------------+
| TERM
+-------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| label | varchar(255) | NO | UNI | NULL | |
| slug | varchar(255) | NO | | NULL | |
+-------+---------------------+------+-----+---------+----------------+
| BAGGING
+----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| topic_id | int(11) unsigned | NO | MUL | NULL | |
| term_id | bigint(11) unsigned | NO | MUL | NULL | |
| weight | tinyint(1) unsigned | NO | | NULL | |
| source | varchar(8) | YES | | GEN | |
+----------+---------------------+------+-----+---------+----------------+
And this is my simple query:
SELECT
bagging.topic_id as topic_id,
topic.label as topic_label,
bagging.term_id as term_id,
term.label as term_label,
bagging.weight as weight,
bagging.source as source
FROM
bagging
JOIN term ON term.id = bagging.term_id
JOIN topic ON topic.id = bagging.topic_id
WHERE
bagging.topic_id = ( SELECT id FROM topic WHERE label = 'Altruism' )
ORDER BY
bagging.weight DESC
Which gives me the below results:
+----------+-------------+---------+-----------------------+--------+--------+
| topic_id | topic_label | term_id | term_label | weight | source |
+----------+-------------+---------+-----------------------+--------+--------+
| 8 | Altruism | 83 | Altruism | 100 | TOPIC |
+----------+-------------+---------+-----------------------+--------+--------+
| 8 | Altruism | 100 | Altruism (philosophy) | 95 | WPRD |
| 8 | Altruism | 100 | Altruism (philosophy) | 95 | MAN |
| 8 | Altruism | 84 | Truist | 95 | MAN |
| 8 | Altruism | 84 | Truist | 15 | WPRD |
+----------+-------------+---------+-----------------------+--------+--------+
| 8 | Altruism | 94 | Selfless action | 95 | WPRD |
| 8 | Altruism | 95 | Alturism | 95 | WPRD |
| 8 | Altruism | 96 | Digital altruism | 95 | WPRD |
| 8 | Altruism | 97 | Selflessly | 95 | WPRD |
| 8 | Altruism | 98 | Altruistical | 95 | WPRD |
| 8 | Altruism | 99 | Law of mutual aid | 95 | WPRD |
| 8 | Altruism | 101 | Altruistically | 95 | WPRD |
| 8 | Altruism | 85 | Altruistic | 95 | WPRD |
| 8 | Altruism | 86 | Altruist | 95 | WPRD |
| 8 | Altruism | 87 | Otherism | 95 | WPRD |
| 8 | Altruism | 88 | Unselfishness | 95 | WPRD |
| 8 | Altruism | 89 | Altruistic behavior | 95 | WPRD |
| 8 | Altruism | 90 | Altutrists | 95 | WPRD |
| 8 | Altruism | 91 | Altruists | 95 | WPRD |
| 8 | Altruism | 102 | Pathological altruism | 95 | WPRD |
+----------+-------------+---------+-----------------------+--------+--------+
Now, how to create a scoring function that would take the below into consideration for this particular example:
Altruism
is unbeatable, can only be equalled ( = 100 )Truist
should obviously be penalized by the 15
/100 weight, yet the fact that there are two should also be factored, especially because the second is 95
Altruist (Philosophy)
should weigh more than all the others (except Altruism
than can only be equaled.) even if the 95 twice looks bigger than 100.The end result doesn't have to be scaled from 1 to 100, it can be a relative or abstract rating that takes into account these constraints.
I tried by calculating, for each row, ( term_sum_weight * 100 / topic_weight_sum_of_all_terms )
but see the below results, they aren't weighed enough.
How can I give a lot more weight for 95 to 96, than 1 to 20, for example?
The formula matters more than the language that will be used... MySQL or in Python/PHP in my programs.
EXPECTED RESULT (along those lines...)
+----------+-------------+---------+-----------------------+-------+--------+
| topic_id | topic_label | term_id | term_label | score | source |
+----------+-------------+---------+-----------------------+-------+--------+
| 8 | Altruism | 83 | Altruism | 1 | TOPIC |
+----------+-------------+---------+-----------------------+-------+--------+
| 8 | Altruism | 100 | Altruism (philosophy) | 0.98 | WPRD |
| 8 | Altruism | 84 | Truist | 0.96 | MAN |
+----------+-------------+---------+-----------------------+--------+-------+
| 8 | Altruism | 94 | Selfless action | 0.95 | MAN |
| 8 | Altruism | 95 | Alturism | 0.95 | MAN |
| 8 | Altruism | 96 | Digital altruism | 0.95 | MAN |
...........
| 8 | Altruism | 97 | Selflessly | 0.95 | MAN |
| 8 | Altruism | 90 | Altutrists | 0.95 | MAN |
| 8 | Altruism | 91 | Altruists | 0.95 | MAN |
| 8 | Altruism | 102 | Pathological altruism | 0.95 | MAN |
+----------+-------------+---------+-----------------------+--------+-------+
Upvotes: 0
Views: 448
Reputation: 60958
You should start with working out some properties that your desired formula should have. Some likely assumptions might be the following:
Conditions 2 and 3 suggest something like addition. But condition 1 tells you that you can't just add, as doing so would exceed the score. One way to view this would be using a physical analogon. Think of your weights as speeds. In everyday life, you can simply add speeds. But at very high speeds, special relativity tells us that we can never exceed the limit of light speed. Adding two speeds which are lower than the speed of light does result in a speed which is higher but still lower than the speed of light. In your setup, the “speed of light” is the maximal wight of 100.
So look up the formula for the addition of speeds and adapt it to your use case. If you have two baggings of weight v and w then the total weight would be
(v + w)/(1 + v*w/10000)
Now you need to either work out a formula which does the above for arbitrary number of summands in a single formula, or you write some application code to incrementally compute the sum between an accumulator and the next item from your database. Or you read on in Wikipedia and find that the following equation holds:
c-s c-v c-w
--- = --- * ---
c+s c+v c+w
Here s is the sum of speeds, or the score in your world. But the larger s becomes, the smaller this fraction will be. So instead of sorting by s in descending order, you may sort by (c-s)/(c+s) in ascending order. Unfortunately MySQL doesn't have a PRODUCT
aggregate function. But you can use logarithms to turn products into sums:
SUM(LOG((100-weight)/(100+weight))) AS score
Unfortunately, MySQL doesn't deal with infinities the way IEEE floating point arithmetic does (i.e. log(0) = -∞), but turns a weight of 100
into a score of NULL
. So you either sort NULL
scores before non-NULL
ones using SORT BY (score IS NULL) DESC
or similar. Or you turn the above back into the value s in a way which handles zeros correctly.
SELECT …,
IF(MAX(weight) = 100, 1,
(1-EXP(SUM(LOG((100-weight)/(100+weight)))))/
(1+EXP(SUM(LOG((100-weight)/(100+weight)))))) AS score
Here you can think of the EXP(SUM(LOG(…)))
as PRODUCT(…)
. To solve (c-s)/(c+s)=p
for s
(with p
as the product I just mentioned) you'd compute s=c*(1-p)/(1+p)
. So by leaving out the c*
there you obtain a score in the range 0 to 1 instead of 0 to 100, similar to what you did in your calculations so far. The MAX(weight) = 100
case is what would result in a LOG(0)
making the whole computation NULL
, so that has to be handled separately.
See http://sqlfiddle.com/#!9/1cd56/1 for an example using your data. It returns 1 for Altruism, 0.9987 for Altruism (philosophy), 0.9628 for Truism and 0.95 for the others. The score for the second is considerably larger than what you expected in your question, but I don't have a good idea to tweak the formula to be more like what you expected.
Upvotes: 2