Anuj
Anuj

Reputation: 333

Caching a fully dynamic website

I made a dynamic site that has over 20,000 pages and once a page is created there is no need to update it for at-least one month or even a year. So I'm caching every page when it is first created and then delivering it from a static html page

I'm running a php script (whole CMS is on PHP) if (file_exists($filename)) to first search for the filename from the url in cache-files directory and if it matches then deliver it otherwise generate the page and cache it for latter use. Though it is dynamic but still my url does not contain ?&=, I'm doing this by - and breaking it into array.

What I want to know is will it create any problem to search for a file from that huge directory?

I saw a few Q/A like this where it says that there should not be problem with number of files I can store on directory with ext2 or ext3 (I guess my server has ext3) file system but the speed of creating a new file will decrease rapidly after there are files over 20-30,000.

Currently I'm on a shared host and I must cache files. My host a soft limit of 100,000 files in my whole box which is good enough so far.

Can someone please give me any better idea about how to cache the site.

Upvotes: 2

Views: 362

Answers (2)

symcbean
symcbean

Reputation: 48387

there should not be problem with number of files I can store on directory with ext2 or ext3

That's rather an old document - there are 2 big differences between ext2 and ext3 - journalling is one, the other is H-TREE indexing of directories (which reduces the impact of storing lots of files in the same directory). While it's trivial to add journalling to an ext2 filesystem and mount it as ext3, this does not give the benefits of dir_index - this requires a full fsck.

Regardless of the filesystem, using a nested directory structure makes the system a lot more manageable and scalable - and avoids performance problems on older filesystems.

(I'm doing 3 other things since I started writing this and see someone else has suggested something similar - however Madara's approach doesn't give an evenly balanced tree, OTOH having a semantic path may be more desirable)

e.g.

define('GEN_BASE_PATH','/var/data/cache-failes');
define('GEN_LEVELS', 2);

function gen_file_path($id) 
{
   $key=md5($id);
   $fname='';
   for ($x=0; $x<=GEN_LEVELS; $x++) {
       $fname=substr($key, 0, 1) . "/";
       $key=substr($key,1);
   }  
   return GEN_BASE_PATH . "/" . $fname . $key; 
}

However the real way to solve the problem would be to serve the content with the right headers and run a caching reverse proxy in front of the webserver (though this isn't really practical for a very lwo volume site).

Upvotes: 0

Madara&#39;s Ghost
Madara&#39;s Ghost

Reputation: 175088

You shouldn't place all of the 20K files in a single directory.

Divide them into directories (by letter, for example), so you access:

a/apple-pie-recipe
j/john-doe-for-presidency

etc.

That would allow you to place more files with less constraints on the file-system, which would increase the speed. (since the FS doesn't need to figure out where your file is in the directory along with other 20k files, it needs to look in about a hundred)

Upvotes: 4

Related Questions