Dom Vinyard
Dom Vinyard

Reputation: 2088

Comment threading with score factoring

I am banging my head against something and I was wondering if somebody more skilled that me could help me out.

My aim is to create a comment thread that factors in a system of comment scoring.

First I'll explain where I am currently.

Say we have a comment thread on an article that looks like the example below. The number in parenthasis is the ID of that comment. ID's are assigned automatically by the database and increment chronologically with each additional comment posted. The number of dashes before the comment text reperesent the comment depth.

(01)"This is a top level comment." 
(02)-"This is a second level comment. A reply to the top level comment above."
(06)-"This is also a second level comment / another reply to comment 01."
(07)--"This is a reply to comment 06."
(03)"This is a different top level comment."
(05)-"This is a reply to the comment above."
(08)--"This is a reply to that comment in turn."
(10)---"This is a deeper comment still."
(04)"This is one more top level comment."
(09)-"This is one more reply."

My first problem was storing this data in a way that means it can be returned in the correct order. If you simply store a depth field and order by depth, it'll bring back all of the top level comments first and then the second level comments etc. This isn't right, we must return the comments with the full parentage still intact.

One way to achieve this is to store the full parentage for each comment.

Comment ID  | Parentage
     01     |              (Comment 01 has no parent because it is top level)
     02     | 01-          (Comment 02 was a reply to comment 01)
     03     | 
     04     |              
     05     | 03-
     06     | 01-
     07     | 01-06-       (Comment 07 has two ancestors 01 and then 06)
     08     | 03-05-
     09     | 04-
     10     | 03-05-08-

Adding another comment record is as simple as grabbing the Parentage from the comment that you are replying to, and appending its ID to form the new parentage. For example, if I was replying to comment 10, I would take it's parentage (03-05-08-) and append its ID (10-). The database would automatically recognise it as the 11th comment, and we'd get:

Comment ID  | Parentage
     01     | 
     02     | 01- 
     03     | 
     04     |              
     05     | 03-
     06     | 01-
     07     | 01-06-
     08     | 03-05-
     09     | 04-
     10     | 03-05-08-
     11     | 03-05-08-10-

Now, when we order the comments for display, we order on a concatenation of Parentage and Comment ID which gives us:

Order by CONCAT(Parentage, ID)

Comment ID  | Parentage    |   CONCAT(Parentage, ID)
     01     |              |   01-
     02     | 01-          |   01-02-
     06     | 01-          |   01-06-
     07     | 01-06-       |   01-06-07-
     03     |              |   03-
     05     | 03-          |   03-05-
     08     | 03-05-       |   03-05-08-
     10     | 03-05-08-    |   03-05-08-10-
     11     | 03-05-08-10- |   03-05-08-10-11-
     04     |              |   04-
     09     | 04-          |   04-09-

Which produces the exact same list as first demonstrated. With Comment 11 which we later added inserted in the correct place:

(01)"This is a top level comment." 
(02)-"This is a reply to the top level comment."
(06)-"This is another reply that was posted later than the first."
(07)--"This is a reply to the second level comment directly above."
(03)"This is a different top level comment."
(05)-"This is a reply to the comment above."
(08)--"This is a reply to the comment above."
(10)---"This is a deeper comment still."
(11)----"THIS COMMENT WAS ADDED IN THE EARLIER EXAMPLE."
(04)"This is one more top level comment."
(09)-"This is one more reply."

Indenting can be done by checking the length of the CONCAT string and multiplying the len(CONCAT(Parentage, ID)) by a set number of pixels. This is great, we have a system of storing comments in a way that recognises their parentage.

Now the problem:

Not all comments are equal. A system of comment scoring is needed to distinguish good comments. Let's say each comment has an upvote button.. while we want to retain parentage, if one comment has two direct replies at the same level then we want to show the one with the most upvotes first. I'll add some votes in [square brackets] below.

(01)"This is a top level comment." [6 votes]
(02)-"This is a reply to the top level comment." [2 votes]
(06)-"This is another reply that was posted later than the first." [30 votes]
(07)--"This is a reply to the second level comment directly above." [5 votes]
(03)"This is a different top level comment." [50 votes]
(05)-"This is a reply to the comment above." [4 votes]
(08)--"This is a reply to the comment above." [0 votes]
(10)---"This is a deeper comment still." [0 votes]
(11)----"THIS COMMENT WAS ADDED IN THE EARLIER EXAMPLE." [0 votes]
(04)"This is one more top level comment." [2 votes]
(09)-"This is one more reply." [0 votes]

In this example, comments (01) and (03) are both top-level but (03) has [50 votes] and (01) only has [6 votes]. (01) appears above only by virtue of the fact that it was posted earlier and therefore has been assigned a smaller ID. Likewise (02) and (06) are both replies to (01) but must be reordered to allow the one with the most votes (06) to rise to the top.

I am completely and utterly stuck in trying to achieve this.

I imagine that any ordering/reordering and indexing would be better done on a comment-vote being cast rather than on page load so that the page-load time can be as quick as possible but beyond that I have absolutely no idea!

Any ideas or light you could shed on possible avenues would really take a load off! Thanks for your help as always.

--------------------------------------------------------------------------------

Edit: In response to @Paddy's solution,

When I run the expression offered by @Paddy below on the mock data, the first error I get is:

"The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified." 

This can be remedied by adding SELECT 'top 100 percent' to the recursive member definition. Once this is done, I get the error:

'CommentTree' has more columns than were specified in the column list.

This can be resolved by adding a 'Level' column to the CommentTree specification. This then prints the data, but it returns all the top level comments first and then something resembling (but not actually matching) the correct sort order after.

The data is returned as such:

ParentId  |  CommentId  |  Comment  |  Vote  | Level
NULL      |      1      | Text here |   6    |  0
NULL      |      3      | Text here |   50   |  0     
NULL      |      4      | Text here |   2    |  0    
4         |      9      | Text here |   0    |  1    
3         |      5      | Text here |   4    |  1    
5         |      8      | Text here |   0    |  2    
8         |      10     | Text here |   0    |  3   
10        |      11     | Text here |   0    |  4    
1         |      2      | Text here |   2    |  1    
1         |      6      | Text here |   30   |  1     
6         |      7      | Text here |   5    |  2    

Have I done anything wrong or did @Paddy miss out something perhaps? Please accept my apologies, recursive functions are very new to me.

Upvotes: 8

Views: 375

Answers (3)

Andrey Gurinov
Andrey Gurinov

Reputation: 2885

Code below looks good for your task. It's a bit complex, but it was a challenge for me to make it in a single SELECT. You can split it into several SELECT with prefetch into temporary tables (for performance purposes), or keep it together.

Thank you for the question, it was interesting!

Please note that ParentID for root nodes must be 0, not NULL.

DECLARE @a TABLE (
    CommentID  INT,
    ParentID INT,
    Comment VARCHAR(100),
    Vote INT
)


INSERT @a
VALUES
    (1, 0, '', 6),
    (3, 0, '', 50),
    (4, 0, '', 2),
    (9, 4, '', 0),
    (5, 3, '', 4),
    (8, 5, '', 0),
    (10, 8, '', 0),
    (11, 10, '', 0),
    (2, 1, '', 2),
    (6, 1, '', 30),
    (7, 6, '', 5)

;WITH CTE_1 (ParentId, CommentId, Comment, Vote, Level, LevelPriority, Path)    -- prepare base info
AS
(
    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, 0 AS Level, ROW_NUMBER() OVER(ORDER BY c.Vote DESC), CAST('/' + CAST(c.CommentId AS VARCHAR(32)) AS VARCHAR(MAX)) + '/'
    FROM @a AS c
    WHERE ParentId = 0

    UNION ALL

    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, Level + 1 AS Level, ROW_NUMBER() OVER(ORDER BY c.Vote DESC), d.Path + CAST(c.CommentId AS VARCHAR(32)) + '/'
    FROM @a AS c
    INNER JOIN CTE_1 AS d
        ON c.ParentID = d.CommentID
),
CTE_2 (ParentId, CommentId, Comment, Vote, Level, LevelPriority, ChildCount)    -- count number of children
AS
(
    SELECT p.ParentId, p.CommentId, p.Comment, p.Vote, p.Level, p.LevelPriority, COUNT(*)
    FROM CTE_1 AS p
    INNER JOIN CTE_1 AS c
        ON c.Path LIKE p.Path + '%'
    GROUP BY 
        p.ParentId, p.CommentId, p.Comment, p.Vote, p.Level, p.LevelPriority
),
CTE_3 (ParentId, CommentId, Comment, Vote, Level, LevelPriority, OverAllPriority, ChildCount) -- calculate overall priorities
AS
(
    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, c.Level, c.LevelPriority, 1 AS OverAllPriority, ChildCount
    FROM CTE_2 AS c
    WHERE Level = 0 AND LevelPriority = 1

    UNION ALL

    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote, c.Level, c.LevelPriority, 
        CASE 
            WHEN c.ParentID = d.CommentID THEN d.OverAllPriority + 1
            ELSE d.OverAllPriority + d.ChildCount
        END,
        c.ChildCount
    FROM CTE_2 AS c
    INNER JOIN CTE_3 AS d
        ON 
            (c.ParentID = d.CommentID AND c.LevelPriority = 1) 
            OR (c.ParentID = d.ParentID AND d.LevelPriority + 1 = c.LevelPriority)
)
SELECT ParentId, CommentId, Comment, Vote
FROM CTE_3
ORDER BY OverAllPriority

In this query I do following:

  1. In CTE_1 I calculate ordering positions within same parent comment (based on votes) and build tree path to collect information about all nodes in the hierarchy.
  2. In CTE_2 I calculate number of descendants belonging to every node +1. Tree path allows to count descendants of all level by one SELECT.
  3. In CTE_3 I calculate overall ordering positions based on 3 simple rules:
    1. The topmost row has position = 1
    2. The upper child node has position = parent_position + 1
    3. The next sibling should go after all descendatns of previous one and has position = prev_sibling_position + prev_sibling_number_of_descendants

EDIT The same solution, but without CTE.

DECLARE @a TABLE (
    CommentID  INT,
    ParentID INT,
    Comment VARCHAR(100),
    Vote INT
)

INSERT @a
VALUES
    (1, 0, '', 6),
    (3, 0, '', 50),
    (4, 0, '', 2),
    (9, 4, '', 0),
    (5, 3, '', 4),
    (8, 5, '', 0),
    (10, 8, '', 0),
    (11, 10, '', 0),
    (2, 1, '', 2),
    (6, 1, '', 30),
    (7, 6, '', 5)


DECLARE @rows INT

DECLARE @temp_table TABLE (
    CommentID  INT,
    ParentID INT,
    Comment VARCHAR(100),
    Vote INT,
    LevelPriority INT, 
    Path VARCHAR(MAX),
    ChildCount INT NULL,
    OverAllPriority INT NULL
)

INSERT @temp_table (CommentID, ParentID, Comment, Vote, LevelPriority, Path)
SELECT CommentID, ParentID, Comment, Vote, ROW_NUMBER() OVER(ORDER BY Vote DESC), '/' + CAST(CommentId AS VARCHAR(32)) + '/'
FROM @a
WHERE ParentID = 0

SELECT @rows = @@ROWCOUNT

WHILE @rows > 0
BEGIN

    INSERT @temp_table (CommentID, ParentID, Comment, Vote, LevelPriority, Path)
    SELECT a.CommentID, a.ParentID, a.Comment, a.Vote, ROW_NUMBER() OVER(PARTITION BY a.ParentID ORDER BY a.Vote DESC), c.Path + CAST(a.CommentId AS VARCHAR(32)) + '/'
    FROM @a AS a
    INNER JOIN @temp_table AS c
        ON a.ParentID = c.CommentID
    WHERE NOT
        a.CommentID IN (SELECT CommentID FROM @temp_table)  

    SELECT @rows = @@ROWCOUNT
END

UPDATE c
SET ChildCount = a.cnt
FROM (
    SELECT p.CommentID, COUNT(*) AS cnt 
    FROM @temp_table AS p
    INNER JOIN @temp_table AS c
        ON c.Path LIKE p.Path + '%'
    GROUP BY 
        p.CommentID
) AS a
INNER JOIN @temp_table AS c
    ON a.CommentID = c.CommentID

UPDATE @temp_table
SET OverAllPriority = 1
WHERE ParentID = 0 AND LevelPriority = 1

SELECT @rows = @@ROWCOUNT

WHILE @rows > 0
BEGIN

    UPDATE c
    SET 
        OverAllPriority = CASE 
            WHEN c.ParentID = p.CommentID THEN p.OverAllPriority + 1
            ELSE p.OverAllPriority + p.ChildCount
        END
    FROM @temp_table AS p
    INNER JOIN @temp_table AS c
        ON (c.ParentID = p.CommentID AND c.LevelPriority = 1) 
            OR (p.ParentID = c.ParentID AND p.LevelPriority + 1 = c.LevelPriority)
    WHERE
        c.OverAllPriority IS NULL  
        AND p.OverAllPriority IS NOT NULL

    SELECT @rows = @@ROWCOUNT
END


SELECT * FROM @temp_table 
ORDER BY OverAllPriority

Upvotes: 4

Mosty Mostacho
Mosty Mostacho

Reputation: 43434

Although not directly related to your question, my advice would be to change to the Nested Set Model. I know it is a lot of rework but sooner or later you'll realise it is the best choice :)

Upvotes: 3

Paddy
Paddy

Reputation: 33867

Using a table definition something like this (self referencing key):

Comment ID  |   Parent ID    |   Comment    |  Vote

You could then use a recursive common table expression (this is in MS SQL), to get your results:

WITH CommentTree (ParentId, CommentId, Comment, Vote)
AS
(
-- Anchor member definition
    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote,
        0 AS Level
    FROM dbo.Comments AS c
    WHERE ParentId IS NULL
    UNION ALL
-- Recursive member definition
    SELECT c.ParentId, c.CommentId, c.Comment, c.Vote,
        Level + 1 AS Level
    FROM dbo.Comments AS c
    INNER JOIN CommentTree AS d
        ON c.ParentID = d.CommentID
    Order by C.Vote
)
SELECT ParentId, CommentId, Comment, Vote FROM CommentTree

CTE reference:

http://msdn.microsoft.com/en-us/library/ms186243.aspx

Upvotes: 0

Related Questions