Reputation: 196
I have a unnormalized database with ~400 million rows, I have moved all repeating data into a new normalized database so I can represent it with id's.
Now I need to move all the data entrys and transform it to a entry containing id's.
The problem is that with 400m row it will take a while.. and i need help to optimize.
This query takes 0.4 sec per row, so it would take a few months:
INSERT IGNORE INTO normalized.entry (insDate, `date`, project, keyword, url, position, competition, serachEngine)
SELECT
CURDATE() as insDate
, d.id as dateId
, p.id as projectId
, k.id as keywordId
, z.id AS urlId
, old.position
, old.competition
, s.id as searchEngineId
FROM unnormalized.bigtable old
INNER JOIN normalized.`date` d ON old.insDate = d.`date`
INNER JOIN normalized.project p ON old.awrProject = p.project
INNER JOIN normalized.searchEngine s ON old.searchEngine = s.searchEngine
INNER JOIN normalized.keyword k ON old.keyword = k.keyword
INNER JOIN normalized.urlHash z ON old.url = z.url
WHERE old.id >= ".$start." AND old.id <= ".$stop."";
If I use more php and divide it into two querys do it this way it takes only 0.07sec per entry but still that's also takes months:
$q = "SELECT tmp.id
, d.id as dateId
, p.id as projectId
, k.id as keywordId
, tmp.position
, tmp.competition
, s.id as searchEngineId
, tmp.url
, z.id AS urlId
FROM unnormalized.bigtable tmp
INNER JOIN normalized.`date` d ON tmp.insDate = d.`date`
INNER JOIN normalized.project p ON tmp.awrProject = p.project
INNER JOIN normalized.searchEngine s ON tmp.searchEngine = s.searchEngine
INNER JOIN normalized.keyword k ON tmp.keyword = k.keyword
INNER JOIN normalized.urlHash z ON tmp.url = z.url
WHERE tmp.id > ".$start." AND tmp.id < ".$stop."";
// echo $q;
$result = mysql_query($q, $local);
if (mysql_num_rows($result) > 0) {
while ($row = mysql_fetch_array($result)) {
$q = "SELECT id FROM normalized.url WHERE url = '".$row["url"]."'";
$resultUrl = mysql_query($q, $local);
$rowUrl = mysql_fetch_array($resultUrl);
$q = "INSERT IGNORE normalized.entry (insDate, `date`, project, keyword, url, position, competition, serachEngine) VALUES (NOW(), '".$row["dateId"]."', '".$row["projectId"]."', '".$row["keywordId"]."', '".$rowUrl["id"]."', '".$row["position"]."', '".$row["competition"]."', '".$row["searchEngineId"]."')";
I don't know how I am gonna port this data without it taking half a year! /All help it needed
spec: Am using InnoDB on a RDS amazon server.
edit: EXPLAIN SELECT of the first query:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra 1,SIMPLE,p,index,NULL,projectName,42,NULL,1346,"Using index" 1,SIMPLE,s,index,NULL,searchEngine,42,NULL,2336,"Using index; Using join buffer" 1,SIMPLE,k,index,NULL,keyword,42,NULL,128567,"Using index; Using join buffer" 1,SIMPLE,tmp,ref,"keyword_url_insDate,keyword,searchEngine,url,awrProject",keyword_url_insDate,767,func,115,"Using where" 1,SIMPLE,d,eq_ref,date,date,3,intradb.tmp.insDate,1,"Using where; Using index" 1,SIMPLE,z,ref,url,url,767,bbointradb.tmp.url,1,"Using index"
SHOW CREATE TABLE:
'rankingUrls201001', 'CREATE TABLE rankingUrls201001
(
id
int(11) NOT NULL AUTO_INCREMENT,
insDate
datetime NOT NULL,
keyword
varchar(255) COLLATE utf8_swedish_ci NOT NULL,
searchEngine
varchar(25) COLLATE utf8_swedish_ci NOT NULL,
url
varchar(255) COLLATE utf8_swedish_ci NOT NULL,
position
int(11) NOT NULL,
competition
varchar(20) COLLATE utf8_swedish_ci NOT NULL,
awrProject
varchar(200) COLLATE utf8_swedish_ci NOT NULL,
server
varchar(20) COLLATE utf8_swedish_ci NOT NULL,
rank
varchar(40) COLLATE utf8_swedish_ci NOT NULL,
PRIMARY KEY (id
),
KEY keyword_url_insDate
(keyword
,url
,insDate
),
KEY keyword
(keyword
),
KEY searchEngine
(searchEngine
),
KEY url
(url
),
KEY awrProject
(awrProject
)
) ENGINE=InnoDB AUTO_INCREMENT=2266575 DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci'
Upvotes: 0
Views: 402
Reputation: 476
@Robin, in order to speed up the data transfer process you should take the proper steps to ensure all processing is done in-memory as well as you should avoid any WRITEs to disk as much as you possibly can. If you do the following steps, the file transfer processing time should be reduced:
First, configure your DB instance to have as much temporary workspace allocated as possible.
Second, split the process into a multiple-step process, so that each processing phase will be completed with minimum WRITE to disk activities.
Finally, disable the binary log; this will reduce WRITE to disk processing, which in many cases can reduce query response times by as much as half.
These steps should speed up the process, but taking into account the amount of data rows you have, it still might take a certain amount of time to transfer and process.
Upvotes: 0