Reputation: 1387
As the title reads, I have a problem with implementing a related articles algorithm. Let me start by listing the tables from the database:
[articles]
id_article
id_category
name
content
publish_date
is_deleted
[categories]
id_category
id_parent
name
[tags_to_articles]
id_tag
id_article
[tags]
id_tag
name
[articles_to_authors]
id_article
id_author
[authors]
id_author
name
is_deleted
[related_articles]
id_article_left
id_article_right
related_score
Every other table except related_articles has data in it. Now i want to fill related_articles with scores between articles (very important: the table will work as an oriented graph, the score of article A with article B could be different than the score between B and A, see the list). The score is computed like this:
I tried to make a query like this:
SELECT a.id, b.id, a.id_category, a.publish_date,
b.id_category, b.publish_date,
c.id_tag,
e.id_author
FROM `articles` a, articles b,
tags_to_articles c, tags_to_articles d,
articles_to_authors e, articles_to_authors f
WHERE a.id_article <> b.id_article AND
(
(a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article)
OR
(a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article)
OR
(a.id_category=b.id_category)
)
In theory, this would list every element worth computing for score. However, this takes way too much time and resources.
Is there another way? I'm also open to adjusting the algorithm or the tables if it gets a workable solution. Also worth noting is that the score calculations are done in a cron, of course I don't expect this to be running on every page request.
Upvotes: 6
Views: 768
Reputation: 8993
I used a method in Sql Server
I gave as many as related tags for each article
then I get related articles by match tags, more same tags mean more related
ALTER PROCEDURE [dbo].[GetRelatedArticles]
@ArticleLang int,
@ArticleURI varchar(100),
@Count int = 10
AS
SET NOCOUNT ON
DECLARE @URI dbo.URICountType;
INSERT INTO @URI([URI], [Count])
SELECT TOP (@Count) ArticleTag.ArticleURI, COUNT(ArticleTag.ArticleURI) AS ArticleCount
FROM ArticleTag WITH (NOLOCK)
INNER JOIN ArticleTag AS ArticleTags WITH (NOLOCK)
ON ArticleTags.ArticleURI = @ArticleURI
AND ArticleTag.ArticleURI <> @ArticleURI
AND ArticleTag.ArticleTag = ArticleTags.ArticleTag
GROUP BY ArticleTag.ArticleURI
SELECT Article.ArticleURI, Article.ArticleLang
FROM Article WITH (NOLOCK)
INNER JOIN (
SELECT MIN(ABS(ArticleLang-@ArticleLang)) AS ArticleLangDifference, ArticleURI
FROM Article WITH (NOLOCK)
WHERE ArticleURI IN (SELECT URI FROM @URI)
GROUP BY ArticleURI
) AS ArticleGrounp
ON Article.ArticleURI = ArticleGrounp.ArticleURI
AND ABS(Article.ArticleLang-@ArticleLang) = ArticleGrounp.ArticleLangDifference
INNER JOIN @URI AS URI
ON Article.ArticleURI = URI.URI
ORDER BY URI.Count DESC, Article.ArticleLastUpdate DESC
Upvotes: 0
Reputation: 1957
Your approach has a correct concept, you'll need a Cartesian product of the articles table with itself. This was the best solution I could come up with, but it will need some testing:
INSERT INTO related_articles
SELECT a_left.id_article,a_right.id_article,
IF(a_left.id_category = a_right.id_category,x,0) +
IF( IFNULL(atu1.id_author,0) AND IFNULL(atu2.id_author,0),
IF(atu1.id_author = atu2.id_author,y,0), 0
) +
IF( IFNULL(tta1.id_tag,0) AND IFNULL(tta2.id_tag,0),
IF(tta1.id_tag = tta2.id_tag,z,0), 0
)
-(CURRENT_TIMESTAMP - UNIX_TIMESTAMP(a_right.publish_date)) AS score
FROM
articles a_left join articles a_right ON a_left.id_article<>a_right.id_article
AND aleft.id_article > CHECKPOINT_ID
LEFT OUTER JOIN articles_to_authors atu1 ON atu1.id_article = a_left.id_article
LEFT OUTER JOIN articles_to_authors atu2 ON atu2.id_article = a_right.id_article
LEFT OUTER JOIN tags_to_articles tta1 ON tta1.id_article = a_left.id_article
LEFT OUTER JOIN tags_to_articles tta2 ON tta2.id_article = a_right.id_article
Perhaps you'll need 2 extra LEFT JOINs to handle deleted authors. The key here is the CHECKPOINT_ID parameter which can be used, so that you can perform this procedure incrementally. This will enable you to process new articles. The alternative (although I cannot see the reason) will be adding a condition such as
... ON a_left.id_article<>a_right.id_article AND
NOT EXISTS(SELECT id_article_left FROM
related_articles WHERE id_article_left = a_left.id_article AND
id_article_right = a_right.id_article) ...
Upvotes: 2
Reputation: 1624
I seriously doubt you'd be able to do something like this with a single statement and get any kind of performance. Break it up into pieces. Use temp tables. Use set operations.
-- First, let's list all tables that share a category.
SELECT a1.id_article as 'left_article',
a2.id_article as 'right_article',
1 as 'score'
INTO #tempscore
FROM #articles a1
INNER JOIN #articles a2 ON
a1.id_category = a2.id_category
AND a1.id_article <> a2.id_article
-- Now, let's add up everything that shares an author
INSERT INTO #tempscore (left_article, right_article, score)
SELECT ata1.id_article,
ata2.id_article,
2
FROM #articles_to_authors ata1
INNER JOIN #articles_to_authors ata2 ON
ata1.id_author = ata2.id_author
-- Now, let's add up everything that shares a a tag
INSERT INTO #tempscore (left_article, right_article, score)
SELECT ata1.id_article,
ata2.id_article,
4
FROM #tags_to_articles ata1
INNER JOIN #tags_to_articles ata2 ON
ata1.id_tag = ata2.id_tag
-- We haven't looked at dates, yet, but let's go ahead and consolidate what we know.
SELECT left_article as 'left_article',
right_article as 'right_article',
SUM (score) as 'total_score'
INTO #cscore
FROM #tempscore
GROUP BY left_article,
right_article
-- Clean up some extranneous stuff
DELETE FROM #cscore WHERE left_article = right_article
-- Now we need to deal with dates
SELECT DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore',
art1.id_article as 'left_article',
art2.publish_date as 'right_article'
INTO #datescore
FROM #cscore
INNER JOIN #articles art1 ON
#cscore.left_article = art1.id_article
INNER JOIN #articles art2 ON
#cscore.right_article = art2.id_article
WHERE art1.publish_date > art2.publish_date
-- And finally, put it all together
INSERT INTO #related_articles (id_article_left, id_article_right, related_score)
SELECT s1.left_article,
s1.right_article,
s1.total_score + IsNull (s2.datescore, 0)
FROM #cscore s1
LEFT JOIN #datescore s2 ON
s1.left_article = s2.left_article
AND s1.right_article = s2.right_article
In my testing, the scores appeared to come out right, but I don't have any real sample data to go off of so I can't be sure. If nothing else, this should give you a basis to start from.
Upvotes: 4