Working with large (tens of millions of rows) datasets

Question

For a simple web application the main requirement is to process around 30 (10m * 3 tables) million records as fast as possible. I haven't worked with such amount of data before so would like some suggestions/advise from experienced people.

The database will be holding details of businesses. Around 25 attributes will describe a single business; name, address etc. Table structure is as follows.

CREATE TABLE IF NOT EXISTS `businesses` (
    `id` bigint(20) NOT NULL AUTO_INCREMENT,
    `type` int(2) NOT NULL,
    `organisation` varchar(40) NOT NULL,
    `title` varchar(12) NOT NULL,
    `given_name` varchar(40) NOT NULL,
    `other_name` varchar(40) NOT NULL,
    `family_name` varchar(40) NOT NULL,
    `suffix` varchar(5) NOT NULL,
    `reg_date` date NOT NULL,
    `main_trade_name` varchar(150) NOT NULL,
    `son_address_l1` varchar(50) NOT NULL,
    `son_address_l2` varchar(50) NOT NULL,
    `son_address_suburb` int(3) NOT NULL,
    `son_address_state` int(2) NOT NULL,
    `son_address_postcode` varchar(10) NOT NULL,
    `son_address_country` int(3) NOT NULL,
    `bus_address_l1` varchar(50) NOT NULL,
    `bus_address_l2` varchar(50) NOT NULL,
    `bus_address_suburb` int(3) NOT NULL,
    `bus_address_state` int(2) NOT NULL,
    `bus_address_postcode` varchar(10) NOT NULL,
    `bus_address_country` int(3) NOT NULL,
    `email` varchar(165) DEFAULT NULL,
    `phone` varchar(12) NOT NULL,
    `website` varchar(80) NOT NULL,
    `employee_size` int(4) NOT NULL,
    PRIMARY KEY (`id`),
    KEY `type` (`type`),
    KEY `phone` (`phone`),
    KEY `reg_date` (`reg_date`),
    KEY `son_address_state` (`son_address_state`),
    KEY `bus_address_state` (`bus_address_state`),
    KEY `son_address_country` (`son_address_country`),
    KEY `bus_address_country` (`bus_address_country`),
    FULLTEXT KEY `title` (`title`),
    FULLTEXT KEY `son_address_l1` (`son_address_l1`),
    FULLTEXT KEY `son_address_l2` (`son_address_l2`),
    FULLTEXT KEY `bus_address_l1` (`bus_address_l1`),
    FULLTEXT KEY `bus_address_l2` (`bus_address_l2`)
) ENGINE=MyISAM;

There going to be 2 other tables like this, reason being each business details will be presented in 3 sources (for comparison purposes). Only one table is going to have writes.

About the app usage,

Few writes, loads of reads.
10 * 3 million of data will not be inserted overtime, its going to be inserted initially.
App is not going to have lots of requests, <10 requests per second.
After the initial data load, users will be updating these details. Comparing one table's data with other 2 and update the data in first table.
There will be lots of searches, mainly by name, by address, by phone and state. Single search will go through all the 3 tables. Searching needs to be fast.
Planing to build it using PHP

My Questions are,

Is it worth to handle 3 sources within one table rather than having 3 tables?
Can MySQL provide a good solution?
Will MongoDB able to handle the same scenario using less hardware resources?
What's the best way to setup a sample database for testing? I purchased a Amazon RDS (large) and inserted 10000 records and doubled them until I get 10 million records.
Any good reading about this subject?

Thank You.

Preet Sangha · Accepted Answer

I cannot answer to your direct question, but I have experience of working with large datasets.

First thing I would work out is what the majority use case (in your case search) operations woud be, and then consider data storage/partitioning based on the that.

Next thing is measure, measure, and measure again. Some database systems will work well with one kind of operation and others with others. As the amount of data increases and operational complexity increases, things that worked well may start to degrade. This is why you measure - don't try to design this without good evidence of how the db systems you're using work under these loads.

And then work iteratively to the add more operations.

Don't try to deisgn a best fit for all. As your design and research is distilled youll see places where optimisations may be needed or availble. You may also find as we've done in in the past, that different type of caching and indexing may beeded at different times.

Good luck - sounds like an interesting project.

Working with large (tens of millions of rows) datasets

Answers (1)

Related Questions