Reputation: 267177

Memory / Optimization concern

I'm working at a complex script which could be processing upto 500,000 records. Here's my question.

Basically my code will parse a text file to get each of those 500,000 or so records. Each record will have a category, my code will need to check if a new record in the categories table had been created for that category during that particular processing, and if not, it will create that record.

So I have 2 options:

1) I store an array of keys=>values containing category name and ID, so I could do this:

if (array_key_exists($category,$allCategories))
   $id=$allCategories[$category];
else
{
   mysql_query("INSERT INTO categories (procId,category) 
                       VALUES ('$procId''$category')");
   $id=mysql_insert_id();
   $allCategories[$category]=$id;
}

2) Each time this text file is processed, it will get its own process ID. So rather than checking the $allCategories variable which could grow to have 100,000+ entries, I could do this:

SELECT id FROM categories WHERE procId='$procId' AND category='$category'

The downside here is that this query will be run for each of the 500,000+ records. Whereas the disadvantage of holding all the categories in an array is that I could run out of memory or the server could crash.

Any thoughts?

Upvotes: 1

Answers (3)

Georg Schölly

Reputation: 126135

Given that your average category name is 30 bytes, you'd only need 30 * 500000 bytes = 15000000 bytes = 15000 kilobytes = 1.5 megabytes.

I think you have this much memory.

Upvotes: 1

Eric Petroelje

Reputation: 60518

Can you just keep a list of the ids that you've already inserted? If they are integer ids, that's 4 bytes each times 100,000 entries would use only about 400K of memory.

ETA:

To avoid storing the category name, hash the name and store the hash. With a 128 bit MD5 hash, that's 16 bytes per hash or only about 1.6MB of memory + overhead.

Upvotes: 2

Tahir Akhtar

Reputation: 11625

One idea will be to add a constraint on table so duplicate inserts are rejected by database. Then just keep inserting all records and let the db do the checking.

Upvotes: 1

Memory / Optimization concern

Answers (3)

Related Questions