user1441871
user1441871

Reputation: 31

Using blob or not, performance issues

First off all, i am not a database expert but a contractor. I hired a (good) programmer but now have some doubts about a certain part of the database design due to some problems we experience and all the information i am reading. Let's start.

We build a housing site that uses a parser to process all the data and store it in a ms-sql database. Every day the feeds contain about 70.000 records of wich most of them also have pictures (average 3) attached. The pictures vary in size from 30kb to 400kb. The database has about the same amount of records. There are about 400 new objects to be processed. This means that every day, all the records in the database have to be entered to see if data has been changes, an object has been deleted or if it is a new object and therefor has to be inserted. The pictures are stored in the database. The feeds are processed on a dual quadcore machine with 32GB of memory and SSA disks. The database is now 600GB in size.

Currently we have about 3000 users a day that look at 6 houses and view 10 images per user on average.

This is what we experience: - The whole process of parsing takes about 13 hours. - We get a lot of timeout errors in the log - We get quit a few of deadlock errors - Google complains about the timeout errors and as a result indexes not much pages. - Google rates the site as slow due to some directories taking more than 10 seconds to load.

I personaly think it has something to do with the pictures in the database and some bad queries. But before i start complaining to my programmer i would like to hear your opinion on this. Thanks in advance for your time.

Update from my programmer: Here is some information on the structure of the tables. There are 2 tables for the images, 1 called imageinfo which is used for quering on the images (like getting a list of imageid’s and content-type for instance) and an image table which contains the id of the image and the BLOB. The imageinfo table has the same id as the image table (1:1 relationship) and has some extra information like the name, type and a hash of the image. That hash is used in the parser process to determine if an image has changed. So the only time when the image table is touched is when there is an insert/update/delete from the parser and the site accesses an image. The time it takes to access and download one image is around the ~350ms.

Upvotes: 3

Views: 5003

Answers (2)

usr
usr

Reputation: 171178

You tell us two problems:

  1. Import is slow
  2. Browsing the site is slow

(2) is easy: You probably need to understand your read queries and index them. This is definitely solvable.

(1) Is much harder to say something about without more specifics. I understand you need to compare lots of blobs - you could store a compact hash of those blogs besides the actual data. That way you don't need to retrieve the blob for comparison purposes and can even index the hash.

Should you have images in the database?

The biggest pros are: Consistent and easy backups, developer convenience. The biggest con is potential misuse. You really can't say in general that images belong into the file system. The database is usually fine for them except if specific and concrete reasons are present to put them somewhere else.

My guess is your usage of those blogs falls under misuse and you would have the same problems if the files were stored in the file system.

Upvotes: 3

Matt Whitfield
Matt Whitfield

Reputation: 6574

You really need to measure where the performance is hurting you. Without knowing what exactly is slow, you can't hope to start fixing it.

However, if you're looking for ideas on where to start measuring, then I would say look at the import process, and see what it is doing in RBAR style. RBAR stands for 'Row By Agonising Row' and aptly describes processes which operate over single rows at a time when they would be far more efficient working in sets.

Another thing that I would check is that you're not actually checking the content of each image to make sure it hasn't changed. If you're doing a binary comparison of that data I can imagine it would be intensely slow. If you compute a checksum for it and compare the checksum then

a) You can compute that checksum outside of the SQL Server process, preferably on another box.
b) You will be able to check for updated images in a much more lean process, particularly if that checksum is an INCLUDE column on a suitable index.

But, as was commented, storing the images in the DB isn't the brightest of ideas anyway.

Upvotes: 0

Related Questions