Valentin Brasso
Valentin Brasso

Reputation: 1387

How to implement a related articles algorithm using this form of collaborative filtering

As the title reads, I have a problem with implementing a related articles algorithm. Let me start by listing the tables from the database:

[articles]
id_article
id_category
name
content
publish_date
is_deleted

[categories]
id_category
id_parent
name

[tags_to_articles]
id_tag
id_article

[tags]
id_tag
name

[articles_to_authors]
id_article
id_author

[authors]
id_author
name
is_deleted

[related_articles]
id_article_left
id_article_right
related_score

The algorithm

Every other table except related_articles has data in it. Now i want to fill related_articles with scores between articles (very important: the table will work as an oriented graph, the score of article A with article B could be different than the score between B and A, see the list). The score is computed like this:

My first (inefficient) approach

I tried to make a query like this:

SELECT a.id, b.id, a.id_category, a.publish_date,
    b.id_category, b.publish_date,
    c.id_tag,
    e.id_author
FROM `articles` a, articles b, 
        tags_to_articles c, tags_to_articles d,
        articles_to_authors e, articles_to_authors f
WHERE a.id_article <> b.id_article AND 
(
    (a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article)
    OR
    (a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article)
    OR
    (a.id_category=b.id_category)
)

In theory, this would list every element worth computing for score. However, this takes way too much time and resources.

Is there another way? I'm also open to adjusting the algorithm or the tables if it gets a workable solution. Also worth noting is that the score calculations are done in a cron, of course I don't expect this to be running on every page request.

Upvotes: 6

Views: 768

Answers (3)

Eric Yin
Eric Yin

Reputation: 8993

I used a method in Sql Server

I gave as many as related tags for each article

then I get related articles by match tags, more same tags mean more related

ALTER PROCEDURE [dbo].[GetRelatedArticles]
    @ArticleLang int,
    @ArticleURI varchar(100),
    @Count int = 10
AS

SET NOCOUNT ON

DECLARE @URI dbo.URICountType;

INSERT INTO @URI([URI], [Count])
SELECT TOP (@Count) ArticleTag.ArticleURI, COUNT(ArticleTag.ArticleURI) AS ArticleCount
FROM ArticleTag WITH (NOLOCK)
INNER JOIN ArticleTag AS ArticleTags WITH (NOLOCK)
ON ArticleTags.ArticleURI = @ArticleURI
AND ArticleTag.ArticleURI <> @ArticleURI
AND ArticleTag.ArticleTag = ArticleTags.ArticleTag
GROUP BY ArticleTag.ArticleURI

SELECT Article.ArticleURI, Article.ArticleLang
FROM Article WITH (NOLOCK)

INNER JOIN (
    SELECT MIN(ABS(ArticleLang-@ArticleLang)) AS ArticleLangDifference, ArticleURI
    FROM Article WITH (NOLOCK)
    WHERE ArticleURI IN (SELECT URI FROM @URI)
    GROUP BY ArticleURI
) AS ArticleGrounp
ON Article.ArticleURI = ArticleGrounp.ArticleURI
AND ABS(Article.ArticleLang-@ArticleLang) = ArticleGrounp.ArticleLangDifference

INNER JOIN @URI AS URI
ON Article.ArticleURI = URI.URI

ORDER BY URI.Count DESC, Article.ArticleLastUpdate DESC

Upvotes: 0

georgepsarakis
georgepsarakis

Reputation: 1957

Your approach has a correct concept, you'll need a Cartesian product of the articles table with itself. This was the best solution I could come up with, but it will need some testing:

INSERT INTO related_articles 
SELECT a_left.id_article,a_right.id_article,
       IF(a_left.id_category = a_right.id_category,x,0) + 
       IF( IFNULL(atu1.id_author,0) AND IFNULL(atu2.id_author,0), 
           IF(atu1.id_author = atu2.id_author,y,0), 0 
       ) +
       IF( IFNULL(tta1.id_tag,0) AND IFNULL(tta2.id_tag,0), 
           IF(tta1.id_tag = tta2.id_tag,z,0), 0 
       ) 
       -(CURRENT_TIMESTAMP - UNIX_TIMESTAMP(a_right.publish_date)) AS score
FROM
articles a_left join articles a_right ON a_left.id_article<>a_right.id_article 
AND aleft.id_article > CHECKPOINT_ID
LEFT OUTER JOIN articles_to_authors atu1 ON atu1.id_article  = a_left.id_article
LEFT OUTER JOIN articles_to_authors atu2 ON atu2.id_article = a_right.id_article
LEFT OUTER JOIN tags_to_articles tta1 ON tta1.id_article = a_left.id_article
LEFT OUTER JOIN tags_to_articles tta2 ON tta2.id_article = a_right.id_article

Perhaps you'll need 2 extra LEFT JOINs to handle deleted authors. The key here is the CHECKPOINT_ID parameter which can be used, so that you can perform this procedure incrementally. This will enable you to process new articles. The alternative (although I cannot see the reason) will be adding a condition such as

... ON a_left.id_article<>a_right.id_article AND 
    NOT EXISTS(SELECT id_article_left FROM
    related_articles WHERE id_article_left = a_left.id_article AND 
    id_article_right = a_right.id_article) ...

Upvotes: 2

I seriously doubt you'd be able to do something like this with a single statement and get any kind of performance. Break it up into pieces. Use temp tables. Use set operations.

-- First, let's list all tables that share a category.
SELECT   a1.id_article as 'left_article',
         a2.id_article as 'right_article',
         1 as 'score'
INTO     #tempscore
FROM     #articles a1
   INNER JOIN #articles a2 ON
         a1.id_category = a2.id_category
     AND a1.id_article <> a2.id_article

-- Now, let's add up everything that shares an author
INSERT INTO #tempscore (left_article, right_article, score)
SELECT   ata1.id_article,
         ata2.id_article,
         2
FROM     #articles_to_authors ata1
   INNER JOIN #articles_to_authors ata2 ON
         ata1.id_author = ata2.id_author

-- Now, let's add up everything that shares a a tag
INSERT INTO #tempscore (left_article, right_article, score)
SELECT   ata1.id_article,
         ata2.id_article,
         4
FROM     #tags_to_articles ata1
   INNER JOIN #tags_to_articles ata2 ON
         ata1.id_tag = ata2.id_tag

-- We haven't looked at dates, yet, but let's go ahead and consolidate what we know.
SELECT   left_article as 'left_article',
         right_article as 'right_article',
         SUM (score) as 'total_score'
INTO     #cscore
FROM     #tempscore
GROUP BY left_article,
         right_article

-- Clean up some extranneous stuff
DELETE FROM #cscore WHERE left_article = right_article

-- Now we need to deal with dates
SELECT   DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore',
         art1.id_article as 'left_article',
         art2.publish_date as 'right_article'
INTO     #datescore
FROM     #cscore
   INNER JOIN #articles art1 ON
         #cscore.left_article = art1.id_article
   INNER JOIN #articles art2 ON
         #cscore.right_article = art2.id_article
WHERE    art1.publish_date > art2.publish_date

-- And finally, put it all together
INSERT INTO #related_articles (id_article_left, id_article_right, related_score)
SELECT   s1.left_article,
         s1.right_article,
         s1.total_score + IsNull (s2.datescore, 0)
FROM     #cscore s1
   LEFT  JOIN #datescore s2 ON
         s1.left_article = s2.left_article
     AND s1.right_article = s2.right_article

In my testing, the scores appeared to come out right, but I don't have any real sample data to go off of so I can't be sure. If nothing else, this should give you a basis to start from.

Upvotes: 4

Related Questions