Rams
Rams

Reputation: 2189

MySql - Index optimization

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of our customer. Each of our customer contains unique domain name that means customer determined by domain nam

Database server : MySql 5.6 Table rows : 400 million

Following is our table schema.

+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| id            | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| domain        | varchar(50)      | NO   | MUL | NULL    |                |
| guid          | binary(16)       | YES  |     | NULL    |                |
| sid           | binary(16)       | YES  |     | NULL    |                |
| url           | varchar(2500)    | YES  |     | NULL    |                |
| ip            | varbinary(16)    | YES  |     | NULL    |                |
| is_new        | tinyint(1)       | YES  |     | NULL    |                |
| ref           | varchar(2500)    | YES  |     | NULL    |                |
| user_agent    | varchar(255)     | YES  |     | NULL    |                |
| stats_time    | datetime         | YES  |     | NULL    |                |
| country       | char(2)          | YES  |     | NULL    |                |
| region        | char(3)          | YES  |     | NULL    |                |
| city          | varchar(80)      | YES  |     | NULL    |                |
| city_lat_long | varchar(50)      | YES  |     | NULL    |                |
| email         | varchar(100)     | YES  |     | NULL    |                |
+---------------+------------------+------+-----+---------+----------------+

In above table guid represents visitor of our customer site and sid represents visitor session of our customer site. That means for every sid there should be associated guid.

We need queries like following

Query 1 : Find unique,total visitors

SELECT count(DISTINCT guid) AS count,count(guid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'

composite index planning : domain,stats_time,sid

Query 2 : Find unique,total sessions

SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'abc' AND stats_time BETWEEN '2015-10-05 00:00:00' AND '2015-10-04 23:59:59'

composite index planning : domain,stats_time,guid

Query 3: Find visitors,sessions by country ,by region, by city

composite index planning : domain,country

composite index planning : domain,region

Each combination is requiring new composite index. That means huge index file, we can't keep this in memory so performance of the queries are low.

Is there any way optimize this index combinations to reduce index size and improve performance.

Upvotes: 1

Views: 67

Answers (1)

DRapp
DRapp

Reputation: 48179

Just for grins, run this to see what type of spread you have...

select 
      country, region, city, 
      DATE_FORMAT(colName, '%Y-%m-%d') DATEONLY, count(*)
   from 
      yourTable 
   group by
      country, region, city, 
      DATE_FORMAT(colName, '%Y-%m-%d')
   order by
      count(*) desc

and then see how many rows it returns. Also, what sort of range does the COUNT column generate. Instead of just an index, does it make sense to create a separate aggregation table on the key elements you are trying to provide with data mining.

If so, I would recommend looking at a similar post also on the stack here. This shows a SAMPLE on how, but I would first look at the counts before suggesting further. But if you have it broken down on a daily basis, what MIGHT this be reduced to.

Additionally, you might want to create pre-aggregate tables ONCE to get started, then have a nightly procedure that builds any new records based on a day just completed. This way it is never running through all 400M records.

If your pre-aggregate tables store based on just the date (y,m,d only), your queries rolled-up per day would shorten querying requirements. The COUNT(*) is just an example basis, but your could add count( distinct whateverColumn ) as needed. Then, you could query the SUM( aggregateColumn ) based on domain, date range, etc. If your 400M records gets reduced down to 7M records, I would also have a minimum index on the (domain, dateOnlyField, and maybe country) to optimize your domain, date-range queries. Once you get something narrowed down at whatever level make sense, you could always drill into the raw data for the granular level.

Upvotes: 2

Related Questions