Reputation: 267177
I'm working at a complex script which could be processing upto 500,000 records. Here's my question.
Basically my code will parse a text file to get each of those 500,000 or so records. Each record will have a category, my code will need to check if a new record in the categories
table had been created for that category during that particular processing, and if not, it will create that record.
So I have 2 options:
1) I store an array of keys=>values containing category name and ID, so I could do this:
if (array_key_exists($category,$allCategories))
$id=$allCategories[$category];
else
{
mysql_query("INSERT INTO categories (procId,category)
VALUES ('$procId''$category')");
$id=mysql_insert_id();
$allCategories[$category]=$id;
}
2) Each time this text file is processed, it will get its own process ID. So rather than checking the $allCategories
variable which could grow to have 100,000+ entries, I could do this:
SELECT id FROM categories WHERE procId='$procId' AND category='$category'
The downside here is that this query will be run for each of the 500,000+ records. Whereas the disadvantage of holding all the categories in an array is that I could run out of memory or the server could crash.
Any thoughts?
Upvotes: 1
Views: 356
Reputation: 126135
Given that your average category name is 30 bytes, you'd only need 30 * 500000 bytes = 15000000 bytes = 15000 kilobytes = 1.5 megabytes.
I think you have this much memory.
Upvotes: 1
Reputation: 60518
Can you just keep a list of the ids that you've already inserted? If they are integer ids, that's 4 bytes each times 100,000 entries would use only about 400K of memory.
ETA:
To avoid storing the category name, hash the name and store the hash. With a 128 bit MD5 hash, that's 16 bytes per hash or only about 1.6MB of memory + overhead.
Upvotes: 2
Reputation: 11625
One idea will be to add a constraint on table so duplicate inserts are rejected by database. Then just keep inserting all records and let the db do the checking.
Upvotes: 1