Reputation: 1167
I am trying to come up with a theoretical solution to an NxN problem for data aggregation and storage. As an example I have a huge amount of data that comes in via a stream. The stream sends the data in points. Each point has 5 dimensions:
- Location
- Date
- Time
- Name
- Statistics
This data then needs to be aggregated and stored to allow another user to come along and query the data for both location and time. The user should be able to query like the following (pseudo-code):
Show me aggregated statistics for Location 1,2,3,4,....N between Dates 01/01/2011 and 01/03/2011 between times 11am and 4pm
Unfortunately due to the scale of the data it is not possible to aggregate all this data from the points on the fly and so aggregation prior to this needs to be done. As you can see though there are multiple dimensions that the data could be aggregated on.
They can query for any number of days or locations and so finding all the combinations would require huge pre-aggregation:
- Record for Locations 1 Today
- Record for Locations 1,2 Today
- Record for Locations 1,3 Today
- Record for Locations 1,2,3 Today
- etc... up to N
Preprocessing all of these combinations prior to querying could result in an amount of precessing that is not viable. If we have 200 different locations then we have 2^200 combinations which would be nearly impossible to precompute in any reasonable amount of time.
I did think about creating records on 1 dimension and then merging could be done on the fly when requested, but this would also take time at scale.
Questions:
Thank you for your time.
EDIT 1
When I say aggregating the data together I mean combining the statistics and name (dimensions 4 & 5) for the other dimensions. So for example if I request data for Locations 1,2,3,4..N then I must merge the statistics and counts of name together for those N Locations before serving it up to the user.
Similarly if I request the data for dates 01/01/2015 - 01/12/2015 then I must aggregate all data between those periods (by adding summing name/statistics).
Finally If I ask for data between dates 01/01/2015 - 01/12/2015 for Locations 1,2,3,4..N then I must aggregate all data between those dates for all those locations.
For the sake of this example lets say that going through statistics requires some sort of nested loop and does not scale well especially on the fly.
Upvotes: 5
Views: 652
Reputation: 7284
Denormalization is a means of addressing performance or scalability in relational database.
IMO having some new tables to hold aggregated data and using them for reporting will help you.
I have a huge amount of data that comes in via a stream. The stream sends the data in points.
There will be multiple ways to achieve denormalization in the case:
In an ideal scenario when a message reaches the streaming level there will be two copies of data message containing location, date, time, name, statistics
dimensions, being dispatched for processing, one goes for OLTP(current application logic) second will goes for an OLAP(BI) process.
The BI process will create denormalized aggregated structures for reporting.
I will suggest having aggregated data record per location, date group.
So end-user will query preprossed data that wont need heavy recalculations, having some acceptable inaccuracy.
How should I go about choosing the right dimension and/or combination of dimensions given that the user is as likely to query on all dimensions?
That will depends on your application logic. If possible limit the user for predefined queries that can be assigned values by the user(like for dates from 01/01/2015 to 01/12/2015). In more complex systems using a report generator above the BI warehouse will be an option.
I'd recommend Kimball's The Data Warehouse ETL Toolkit.
Upvotes: 1
Reputation: 10064
You can at least reduce Date and Time to a single dimension, and pre-aggregate your data based on your minimum granularity, e.g. 1-second or 1-minute resolution. It could be useful to cache and chunk your incoming stream for the same resolution, e.g. append totals to the datastore every second instead of updating for every point.
What's the size and likelyhood of change of the name and location domains? Is there any relation between them? You said that location could be as many as 200. I'm thinking that if name is a very small set and unlikely to change, you could hold counts of names in per-name columns in a single record, reducing the scale of the table to 1 row per location per unit of time.
Upvotes: 1
Reputation: 1723
Is there really likely to be a way of doing this without brute forcing it in some way?
I'm only familiar with relational databases, and I think that the only real way to tackle this is with a flat table as suggested before i.e. all your datapoints as fields in a single table. I guess that you just have to decide how to do this, and how to optimize it.
Unless you have to maintain 100% to the single record accuracy, then I think the question really needs to be, what can we throw away.
I think my approach would be to:
Obviously I'm betting that quantising the time domain in this way is acceptable. You could supply interactive drill-down by querying back onto the raw data by time domain too, but that would still be slow.
Hope this helps.
Mark
Upvotes: 0
Reputation: 4138
I have worked with a point-of-sale database with hundred thousand products and ten thousand stores (typically week-level aggregated sales but also receipt-level stuff for basket analysis, cross sales etc.). I would suggest you to have a look at these:
In my experiments ElasticSearch was faster than Microsoft's column store or clustered index tables for small and medium-size queries by 20 - 50% on same hardware. To have fast response times you must have sufficient amount of RAM to have necessary data structures loaded in-memory.
I know I'm missing many other DB engines and platforms but I am most familiar with these. I have also used Apache Spark but not in data aggregation context but for distributed mathematical model training.
Upvotes: 0
Reputation: 164
You should check out Apache Flume and Hadoop http://hortonworks.com/hadoop/flume/#tutorials
The flume agent can be used to capture and aggregate the data into HDFS, and you can scale this as needed. Once it is in HDFS there are many options to visualize and even use map reduce or elastic search to view the data sets you are looking for in the examples provided.
Upvotes: 0
Reputation: 3018
From your description it seems that your data is a time-series dataset. The user seems to be mostly concerned about the time when querying and after selecting a time frame, the user will refine the results by additional conditions.
With this in mind, I suggest you to try a time-series database like InfluxDB or OpenTSD. For example, Influx provides a query language that is capable of handling queries like the following, which comes quite close to what you are trying to achieve:
SELECT count(location) FROM events
WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13'
GROUP BY time(10m);
I am not sure what you mean by scale, but the time-series DBs have been designed to be fast for lots of data points. I'd suggest to definitely give them a try before rolling your own solution!
Upvotes: 2
Reputation: 252
you have a lot of datas. It will take a lot of time with all methods due to the amount of datas you're trying to parse. I have two methods to give. First one is a brutal one, you probably thought off:
id | location | date | time | name | statistics
0 | blablabl | blab | blbl | blab | blablablab
1 | blablabl | blab | blbl | blab | blablablab
ect.
With this one, you can easily parse and get elements, they are all in the same table, but the parsing is long and the table is enormous.
Second one is better I think:
Multiple tables:
id | location
0 | blablabl
id | date
0 | blab
id | time
0 | blab
id | name
0 | blab
id | statistics
0 | blablablab
With this you could parse (a lot) faster, getting the IDs and then taking all the needed informations. It also allow you to preparse all the datas: You can have the locations sorted by location, the time sorted by time, the name sorted by alphabet, ect, because we don't care about how the ID's are mixed: If the id's are 1 2 3 or 1 3 2, no one actually care, and you would go a lot faster with parsing if your datas are already parsed in their respective tables.
So, if you use the second method I gave: At the moment where you receive a point of data, give an ID to each of his columns:
You receive:
London 12/12/12 02:23:32 donut verygoodstatsblablabla
You add the ID to each part of this and go parse them in their respective columns:
42 | London ==> goes with London location in the location table
42 | 12/12/12 ==> goes with 12/12/12 dates in the date table
42 | ...
With this, you want to get all the London datas, they are all side by side, you just have to take all the ids, and get the other datas with them. If you want to take all the datas between 11/11/11 and 12/12/12, they are all side by side, you just have to take the ids ect..
Hope I helped, sorry for my poor english.
Upvotes: 0