Mark
Mark

Reputation: 161

Need efficient way to do simple computation on 200million row database (PHP)

I am requesting a PHP solution for the following problem:

I have a ~15 tables in a database, each with 10-50 million rows, all summing up to 200 million rows, with columns userID, B, C, D.

I have 9 other tables with columns userID, fbID. Each table has ~2 million rows. There is a one to one mapping from userID to fbID.

My goal is to output a file of these 200 million rows with columns fbID, B, C, D.

In order to do this, I must search through all the 9 tables that contain columns userID and fbID because the userID may be found in one table, but not the others. I can stop as soon as I find the userID in any one of these tables. This part I am using SQL along with PHP. The SQL query includes LIMIT 1 so I only return 1 row whenever I find the userID, as these tables can contain multiple rows with the same userID.

This algorithm unfortunately takes ~60s/1k rows, which will take me ~130 days to finish.

Is there a more efficient way to do this?

I'm not an expert on how database computation times work, but some ideas I've thought of:

-query through all 9 tables and make a lookup table with userID keys and fBID values.

-create a new table inside the database using these 9 tables with one row per userID, along with the corresponding FBID and search through this one.

Here's more specific info on the tables:

Tables summing up to 200 million rows (each one looks like this):

Column         Type        Null      Default 

dtLogTime      datetime    Yes       NULL 

iUin           int(10)     No         

B              int(10)     No

C              int(10)     No

D              int(10)     No

Indexes:

Keyname   Type  Unique Packed Column    Cardinality Collation Null Comment 

dtLogTime BTREE No     No     dtLogTime 323542      A         YES  

iUin      BTREE No     No     iUin      323542      A

One of the 9 other tables:

Column     Type        Null     Default     Comments 

dtLogTime  datetime    Yes      NULL   

iUin       int(10)     No         

vFBID      varchar(48) No    

Indexes:

Keyname   Type  Unique Packed Column    Cardinality Collation Null Comment 

dtLogTime BTREE No     No     dtLogTime 2179789     A         YES  

iUin      BTREE No     No     iUin      2179789     A  

Sample Code I've tried:

// returns FBID of iuin
function getFBID($iuin){

$query = sprintf("SELECT vFBID FROM `tbReg` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
  $row = mysql_fetch_assoc($result);
  return $row['vFBID'];
}
mysql_free_result($result);

$query = sprintf("SELECT vFBID FROM `tbOnline` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
  $row = mysql_fetch_assoc($result);
  return $row['vFBID'];
}
mysql_free_result($result);

$query = sprintf("SELECT vFBID FROM `tbConsumeFBC` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
  $row = mysql_fetch_assoc($result);
  return $row['vFBID'];
}
mysql_free_result($result);

$query = sprintf("SELECT vFBID FROM `tbFeed` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
   $row = mysql_fetch_assoc($result);
   return $row['vFBID'];
}
mysql_free_result($result);

$query = sprintf("SELECT vFBID FROM `tbInvite` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
   $row = mysql_fetch_assoc($result);
   return $row['vFBID'];
}
mysql_free_result($result);  

$query = sprintf("SELECT vFBID FROM `tbFreeGift` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
   $row = mysql_fetch_assoc($result);
   return $row['vFBID'];
}
mysql_free_result($result); 

$query = sprintf("SELECT vFBID FROM `tbUninstall` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
   $row = mysql_fetch_assoc($result);
   return $row['vFBID'];
}
mysql_free_result($result);  

$query = sprintf("SELECT vFBID FROM `tbDownload` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
   $row = mysql_fetch_assoc($result);
   return $row['vFBID'];
}

$query = sprintf("SELECT vFBID FROM `tbIUserSource` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
   $row = mysql_fetch_assoc($result);
   return $row['vFBID'];
}
mysql_free_result($result);
}

fwrite($handle, '"Time","FBID","Action","ActionID"' . "\n");

$query = sprintf("SELECT count(dtLogTime) AS length
                  FROM `tbActionWeeding`");
$result = mysql_query($query);
$row = mysql_fetch_assoc($result);
$length = ceil($row['length'] * 0.0001);
$start = 0;
$i = 0;
while($i++ < 10000)
   $query = sprintf("SELECT dtLogTime, iuin, iWeedID
                     FROM `tbActionWeeding`
                     LIMIT " . $start . "," . $length);
   $result = mysql_query($query);
   if (!$result) {
      $message  = 'Invalid query: ' . mysql_error() . "\n";
      $message .= 'Whole query: ' . $query . "\n";
      die($message);
   }
   while($row = mysql_fetch_assoc($result))
      fwrite($handle, '"' . $row['dtLogTime'] . '","' . getFBID($row['iuin']) .
                   '","0","' . $row['iWeedID'] . "\"\n");
   mysql_free_result($result);
   $start += $length;
}

Upvotes: 2

Views: 407

Answers (3)

Ken Downs
Ken Downs

Reputation: 4827

Just understand that the most efficient method possible may still take some time just because of the row count.

The first real problem is that you need this to be in PHP. How absolute is that? If it is at all possible to work on the database itself, you want to do this:

-- 
-- Index all 9 tables on userid,fbId

select UserId,fbId
  into WorkingTable_UserId_to_fbId
  from table1Of9
union all
select UserId,fbId
  from table_2_of_9
--
-- repeat the UNION all clause up to:
UNION ALL
select UserId,fbId
  from table_9_of_9
GROUP BY 1,2

-- Index resulting table on userId,fbId 

This gives you a working table that makes the basic query this:

select Linker.Fbid,main.b,main.c.,main.d
  from mainTable main
  JOIN WorkingTable_UserId_to_FbId linker on main.userId = linker.userId

If it is absolutely not possible to create that table, then you must take the same code and insert it into the query above, and it just ain't gonna go that fast. It would be:

select Linker.Fbid,main.b,main.c.,main.d
  from mainTable main
  JOIN (  select UserId,fbId
           from table1Of9
          union all
         select UserId,fbId
           from table_2_of_9
         -- etc, etc.

       ) Linker on main.userId = linker.userId

However, this will likely stall out as the server tries to gather 200 million rows to get ready to return to PHP. So you need to break it into chunks, fetching perhaps 10000 rows at a shot. It may be tempting to add OFFSET...LIMIT to the above query, but that still puts a heavy load on the server. it is better to handle that in PHP, something like:

# Very sloppy code off the top of my head,
# modify this loop based on what you know of the
# userId values
$id = 1;
while($id <= 200000000) {
    $topId = $id + 9999;
    $sql="select Linker.Fbid,main.b,main.c.,main.d
            from mainTable main
            JOIN WorkingTable_UserId_to_FbId linker on main.userId = linker.userId
           WHERE main.userId between $id and $topId";

    # Note: don't freak out about SQL injection in the above code,
    #       you are hardcoding the values of ID, not getting them from a user

    #
    # Execute query, retrieve rows, output
    # then up the counter:
    $id+=1000;
}

Upvotes: 0

George Cummins
George Cummins

Reputation: 28936

I have 9 other tables with columns userID, fbID

and

these other 9 tables each have ~2 million rows

The inefficiency of this data structure cannot easily be surmounted using clever code alone. Because you are required to process huge amounts of redundant data, the most efficient algorithm will run slowly against this architecture.

What you need is normalization. You should alter the structure of your tables to remove redundant data. This will eliminate the need to search nine separate tables 200 million times, providing a considerable improvement in efficiency.

Upvotes: 5

Mchl
Mchl

Reputation: 62395

Now this could work, although just like others said in comments, it would be good to know if you have proper indexes.

SELECT
  u.fbID, t.B, t.C, t.d
FROM
  veryLargeTable AS t
CROSS JOIN (
  SELECT userId, fbID FROM
    smallerTable1 
  UNION SELECT userId, fbID FROM
    smallerTable2 
  ...
  UNION SELECT userId, fbID FROM
    smallerTable9 
) AS u USING (userId)

You might want to run it on smaller dataset first to see how it performs.

Upvotes: 0

Related Questions