sceaj
sceaj

Reputation: 1653

Poor Performance of Mondrian w/ Degenerate Dimensions

I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.

We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:

2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples [[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0" from "METRIC" "METRIC" group by "METRIC"."COLOR" order by "METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] - 223: , exec 50704 ms

In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.

I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.

My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.

Edit for Clarification:

I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:

select [Color].[Color].Members on columns,
       {[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]

I can look at this and hand write a SQL query that answers this query

select COLOR, avg(VALUE), sum(FACT_COUNT) 
from AGG_DAY_METRIC
where YEAR = 2014 
  and MONTH = 6
  and DAY_OF_MONTH = 11
group by COLOR

The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the query because it does several queries that don't answer the MDX query directly, but rather get information about the dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s of millions of rows.

So my questions are:

a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?

b) Is there a way to have Mondrian answer these queries from aggregate tables?

Upvotes: 1

Views: 2452

Answers (3)

sceaj
sceaj

Reputation: 1653

This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.

The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.

In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.

Upvotes: 1

nsousa
nsousa

Reputation: 4544

If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;

If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.

However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).

You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.

You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.

Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.

Upvotes: 0

claj
claj

Reputation: 5402

It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).

Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).

The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.

Upvotes: 0

Related Questions