Scarface
Scarface

Reputation: 3923

Need expert advice on complex nested queries

I have 3 queries. I was told that they were potentially inefficient so I was wondering if anyone who is experienced could suggest anything. The logic is somewhat complex so bear with me.

I have two tables: shoutbox, and topic. Topic stores all information on topics that were created, while shoutbox stores all comments pertaining to each topic. Each comment comes with a group labelled by reply_chunk_id. The earliest timestamp is the first comment, while any following with the same reply_chunk_id and a later timestamp are replies. I would like to find the latest comment for each group that was started by the user (made first comment) and if the latest comment was made this month display it.

What I have written achieves that with one problem: all the latest comments are displayed in random order. I would like to organize these groups/latest comments. I really appreciate any advice

Shoutbox

Field        Type
-------------------
id              int(5)           
timestamp       int(11)              
user            varchar(25)      
message         varchar(2000)    
topic_id        varchar(35)     
reply_chunk_id  varchar(35)

Topic

id                    mediumint(8)       
topic_id              varchar(35)            
subject_id            mediumint(8)           
file_name             varchar(35)            
topic_title           varchar(255)           
creator               varchar(25)            
topic_host            varchar(255)           
timestamp             int(11)            
color                 varchar(10)            
mp3                   varchar(75)        
custom_background     varchar(55)            
description           mediumtext                     
content_type          tinyint(1)

Query

$sql="SELECT reply_chunk_id FROM shoutbox 
GROUP BY reply_chunk_id 
HAVING count(*) > 1 
ORDER BY timestamp DESC ";
$stmt16 = $conn->prepare($sql);
$result=$stmt16->execute();
while($row = $stmt16->fetch(PDO::FETCH_ASSOC)){


$sql="SELECT user,reply_chunk_id, MIN(timestamp) AS grp_timestamp
FROM shoutbox WHERE reply_chunk_id=? AND user=?";
$stmt17 = $conn->prepare($sql);
$result=$stmt17->execute(array($row['reply_chunk_id'],$user));
while($row2 = $stmt17->fetch(PDO::FETCH_ASSOC)){


$sql="SELECT t.topic_title, t.content_type, t.subject_id, 
    t.creator, t.description, t.topic_host,
    c1.message, c1.topic_id, c1.user, c1.timestamp AS max
FROM shoutbox c1 
JOIN topic t ON (t.topic_id = c1.topic_id)
WHERE reply_chunk_id = ? AND c1.timestamp > ?
ORDER BY c1.timestamp DESC, c1.id
LIMIT 1";
$stmt18 = $conn->prepare($sql);
$result=$stmt18->execute(array($row2['reply_chunk_id'],$month));
while($row3 = $stmt18->fetch(PDO::FETCH_ASSOC)){

Upvotes: 1

Views: 391

Answers (2)

DRapp
DRapp

Reputation: 48169

Sounds like most of it should be directly from your ShoutBox table. Prequery to find all "Chunks" the user replied to... of those chunks (and topic_ID since each chunk is always the same topic), get their respective minimum and maximum. Using the "Having count(*) > 1" will force only those that HAVE a second posting by a given user (what you were looking for).

THEN, re-query to the chunks to get the minimum regardless of user. This prevents the need of querying ALL chunks. Then join only what a single user is associated with back to the Topic.

Additionally, and I could be incorrect and need to adjust (minimally), but it appears that the SOUNDBOX table ID column would be an auto-increment column, and just happens to be time-stamped too at time of creation. That said, for a given "Chunk", the earliest ID would be the same as the earliest timestamp as they would be stamped at the same time they are created. Also makes easier on subsequent JOINs and sub query too.

By using STRAIGHT_JOIN, should force the "PreQuery" FIRST, come up with a very limited set, then qualify the WHERE clause and joins afterwords.

select STRAIGHT_JOIN
      T.topic_title, 
      T.content_type, 
      T.subject_id, 
      T.creator, 
      T.description, 
      T.topic_host,
      sb2.Topic_ID
      sb2.message, 
      sb2.user,
      sb2.TimeStamp
   from
      ( select
              sb1.Reply_Chunk_ID,
              sb1.Topic_ID,
              count(*) as TotalEntries,
              min( sb1.id ) as FirstIDByChunkByUser,
              min( sbJoin.id ) as FirstIDByChunk,
              max( sbJoin.id ) as LastIDByChunk,
              max( sbJoin.timestamp ) as LastTimeByChunk
           from
              ShoutBox sb1
                 join ShoutBox sbJoin
                    on sb1.Reply_Chunk_ID = sbJoin.Reply_Chunk_ID
           where 
              sb1.user = CurrentUser

           group by
              sb1.Reply_Chunk_ID,
              sb1.Topic_ID

           having 
              min( sb1.id ) = min( sbJoin.ID ) ) PreQuery 

      join Topic T on
         PreQuery.Topic_ID = T.ID

      join ShoutBox sb2
         PreQuery.LastIDByChunk = sb2.ID

   where
      sb2.TimeStamp >= YourTimeStampCriteria

   order by
      sb2.TimeStamp desc

EDIT ---- QUERY EXPLANATION -- with Modified query. I've changed the query from re-reading (as was almost midnight when answered after holiday weekend :)

First, "STRAIGHT_JOIN" is a MySQL clause telling the engine to "do the query in the way / sequence I've stated". Basically, sometimes an engine will try to think for you and optimize in ways that may appear more efficient, but if based on your data, you know what will retrieve the smallest set of data first, and then join to other lookup fields next might in fact be better. Second the "PreQuery". If you have a "SQL-Select" statement (within parens) as Alias "From" clause, The "PreQuery" is just the name of the alias of the resultset... I could have called it anything, just makes sense that this is a stand-alone query of it's own. (Ooops... fixed to ShoutBox :) As for case-sensitivity, typically Column names are NOT case-sensitive... However, table names are... You could have a table name "MyTest" different than "mytest" or "MYTEST". But by supplying "alias", it helps shorten readability (especially with VeryLongTableNamesUsed ).

Should be working after the re-reading and applying adjustments.. Try the first "Prequery" on its own to see how many records it returns. On its own merits, it should return... for a single "CurrentUser" parameter value, every "Reply_Chunk_ID" (which will always have the same topic_id", get the first ID the person entered (min()). By JOINing again to Shoutbox on the chunk id, we (only those qualified as entered by the user), get the minimum and maximum ID per the chunk REGARDLESS of who started or responded. By applying the HAVING clause, this should only return those where the same person STARTED the topic (hence both have the same min() value.)

Finally, once those have been qualified, join directly to the TOPIC and SHOUTBOX tables again on their own merits of topic_id and LastIDByChunk and order the final results by the latest comment response timestamp descending.

I've added a where clause to further limit your "timestamp" criteria where the most recent final timestamp is on/after the given time period you want.

I would be curious how this query's time performance works compared to your already accepted answer too.

Upvotes: 2

Johan
Johan

Reputation: 76723

Make the first query:

SELECT reply_chunk_id FROM shoutbox 
GROUP BY reply_chunk_id 
HAVING count(*) > 1 
ORDER BY timestamp DESC

This does the same, but is faster. Make sure you have an index on reply_chunk_id.

The second query:

SELECT user,reply_chunk_id, MIN(timestamp) AS grp_timestamp
FROM shoutbox WHERE reply_chunk_id=? AND user=?

The GROUP BY is unneeded, because only one row gets returned, because of the MIN() and the equality tests.

The third query:

SELECT t.topic_title, t.content_type, t.subject_id, 
    t.creator, t.description, t.topic_host,
    c1.message, c1.topic_id, c1.user, c1.timestamp AS max
FROM shoutbox c1 
JOIN topic t ON (t.topic_id = c1.topic_id)
WHERE reply_chunk_id = ? AND c1.timestamp > ?
ORDER BY c1.timestamp DESC, c1.id
LIMIT 1

Doing it all in one query:

SELECT 
    t.user,t.reply_chunk_id, MIN(t.timestamp) AS grp_timestamp,
    t.topic_title, t.content_type, t.subject_id, 
    t.creator, t.description, t.topic_host,
    c1.message, c1.topic_id, c1.user, c1.timestamp AS max
FROM shoutbox c1 
INNER JOIN topic t ON (t.topic_id = c1.topic_id)
LEFT JOIN shoutbox c2 ON (c1.id = c2.id and c1.timestamp < c2.timestamp)
WHERE c2.timestamp IS NULL AND t.user = ?
GROUP BY t.reply_chunk_id
HAVING count(*) > 1
ORDER BY t.reply_chunk_id

or the equivalent

SELECT 
    t.user,t.reply_chunk_id, MIN(t.timestamp) AS grp_timestamp,
    t.topic_title, t.content_type, t.subject_id, 
    t.creator, t.description, t.topic_host,
    c1.message, c1.topic_id, c1.user, c1.timestamp AS max
FROM shoutbox c1 
INNER JOIN topic t ON (t.topic_id = c1.topic_id)
WHERE c1.timestamp = (SELECT max(timestamp) FROM shoutbox c2 
                      WHERE c2.reply_chunk_id = c1.reply_chunk_id)
  AND t.user = ?
GROUP BY t.reply_chunk_id
HAVING count(*) > 1
ORDER BY t.reply_chunk_id

How does this work?

  1. The group by selects one entry per topic.reply_chunk_id
  2. The left join (c1.id = c2.id and c1.`timestamp` < c2.`timestamp`) + WHERE c2.`timestamp` IS NULL selects only those items from shoutbox which have the highest timestamp. This works because MySQL keeps increasing c1.timestamp to get c2.timestamp to be null as soon as that is true, it c1.timestamp will have reached its maximum value and will select that row within the possible rows to choose from.

If you don't understand point 2, see: http://dev.mysql.com/doc/refman/5.0/en/example-maximum-column-group-row.html

Note that the PDO is autoescaping the fields with backticks

Upvotes: 3

Related Questions