Reputation: 161
I am requesting a PHP solution for the following problem:
I have a ~15 tables in a database, each with 10-50 million rows, all summing up to 200 million rows, with columns userID, B, C, D.
I have 9 other tables with columns userID, fbID. Each table has ~2 million rows. There is a one to one mapping from userID to fbID.
My goal is to output a file of these 200 million rows with columns fbID, B, C, D.
In order to do this, I must search through all the 9 tables that contain columns userID and fbID because the userID may be found in one table, but not the others. I can stop as soon as I find the userID in any one of these tables. This part I am using SQL along with PHP. The SQL query includes LIMIT 1 so I only return 1 row whenever I find the userID, as these tables can contain multiple rows with the same userID.
This algorithm unfortunately takes ~60s/1k rows, which will take me ~130 days to finish.
Is there a more efficient way to do this?
I'm not an expert on how database computation times work, but some ideas I've thought of:
-query through all 9 tables and make a lookup table with userID keys and fBID values.
-create a new table inside the database using these 9 tables with one row per userID, along with the corresponding FBID and search through this one.
Here's more specific info on the tables:
Tables summing up to 200 million rows (each one looks like this):
Column Type Null Default
dtLogTime datetime Yes NULL
iUin int(10) No
B int(10) No
C int(10) No
D int(10) No
Indexes:
Keyname Type Unique Packed Column Cardinality Collation Null Comment
dtLogTime BTREE No No dtLogTime 323542 A YES
iUin BTREE No No iUin 323542 A
One of the 9 other tables:
Column Type Null Default Comments
dtLogTime datetime Yes NULL
iUin int(10) No
vFBID varchar(48) No
Indexes:
Keyname Type Unique Packed Column Cardinality Collation Null Comment
dtLogTime BTREE No No dtLogTime 2179789 A YES
iUin BTREE No No iUin 2179789 A
Sample Code I've tried:
// returns FBID of iuin
function getFBID($iuin){
$query = sprintf("SELECT vFBID FROM `tbReg` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbOnline` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbConsumeFBC` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbFeed` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbInvite` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbFreeGift` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbUninstall` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
$query = sprintf("SELECT vFBID FROM `tbDownload` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
$query = sprintf("SELECT vFBID FROM `tbIUserSource` WHERE iuin = " . $iuin . " LIMIT 1");
$result = mysql_query($query);
if(mysql_num_rows($result) != 0){
$row = mysql_fetch_assoc($result);
return $row['vFBID'];
}
mysql_free_result($result);
}
fwrite($handle, '"Time","FBID","Action","ActionID"' . "\n");
$query = sprintf("SELECT count(dtLogTime) AS length
FROM `tbActionWeeding`");
$result = mysql_query($query);
$row = mysql_fetch_assoc($result);
$length = ceil($row['length'] * 0.0001);
$start = 0;
$i = 0;
while($i++ < 10000)
$query = sprintf("SELECT dtLogTime, iuin, iWeedID
FROM `tbActionWeeding`
LIMIT " . $start . "," . $length);
$result = mysql_query($query);
if (!$result) {
$message = 'Invalid query: ' . mysql_error() . "\n";
$message .= 'Whole query: ' . $query . "\n";
die($message);
}
while($row = mysql_fetch_assoc($result))
fwrite($handle, '"' . $row['dtLogTime'] . '","' . getFBID($row['iuin']) .
'","0","' . $row['iWeedID'] . "\"\n");
mysql_free_result($result);
$start += $length;
}
Upvotes: 2
Views: 407
Reputation: 4827
Just understand that the most efficient method possible may still take some time just because of the row count.
The first real problem is that you need this to be in PHP. How absolute is that? If it is at all possible to work on the database itself, you want to do this:
--
-- Index all 9 tables on userid,fbId
select UserId,fbId
into WorkingTable_UserId_to_fbId
from table1Of9
union all
select UserId,fbId
from table_2_of_9
--
-- repeat the UNION all clause up to:
UNION ALL
select UserId,fbId
from table_9_of_9
GROUP BY 1,2
-- Index resulting table on userId,fbId
This gives you a working table that makes the basic query this:
select Linker.Fbid,main.b,main.c.,main.d
from mainTable main
JOIN WorkingTable_UserId_to_FbId linker on main.userId = linker.userId
If it is absolutely not possible to create that table, then you must take the same code and insert it into the query above, and it just ain't gonna go that fast. It would be:
select Linker.Fbid,main.b,main.c.,main.d
from mainTable main
JOIN ( select UserId,fbId
from table1Of9
union all
select UserId,fbId
from table_2_of_9
-- etc, etc.
) Linker on main.userId = linker.userId
However, this will likely stall out as the server tries to gather 200 million rows to get ready to return to PHP. So you need to break it into chunks, fetching perhaps 10000 rows at a shot. It may be tempting to add OFFSET...LIMIT to the above query, but that still puts a heavy load on the server. it is better to handle that in PHP, something like:
# Very sloppy code off the top of my head,
# modify this loop based on what you know of the
# userId values
$id = 1;
while($id <= 200000000) {
$topId = $id + 9999;
$sql="select Linker.Fbid,main.b,main.c.,main.d
from mainTable main
JOIN WorkingTable_UserId_to_FbId linker on main.userId = linker.userId
WHERE main.userId between $id and $topId";
# Note: don't freak out about SQL injection in the above code,
# you are hardcoding the values of ID, not getting them from a user
#
# Execute query, retrieve rows, output
# then up the counter:
$id+=1000;
}
Upvotes: 0
Reputation: 28936
I have 9 other tables with columns userID, fbID
and
these other 9 tables each have ~2 million rows
The inefficiency of this data structure cannot easily be surmounted using clever code alone. Because you are required to process huge amounts of redundant data, the most efficient algorithm will run slowly against this architecture.
What you need is normalization. You should alter the structure of your tables to remove redundant data. This will eliminate the need to search nine separate tables 200 million times, providing a considerable improvement in efficiency.
Upvotes: 5
Reputation: 62395
Now this could work, although just like others said in comments, it would be good to know if you have proper indexes.
SELECT
u.fbID, t.B, t.C, t.d
FROM
veryLargeTable AS t
CROSS JOIN (
SELECT userId, fbID FROM
smallerTable1
UNION SELECT userId, fbID FROM
smallerTable2
...
UNION SELECT userId, fbID FROM
smallerTable9
) AS u USING (userId)
You might want to run it on smaller dataset first to see how it performs.
Upvotes: 0