How does columnar database optimization differ from relational database optimization?

I have the following database structure, stored in a relational database:

Two fact tables with ~80 million rows each
Three dimension tables that have between 300,000 - 500,000 rows
Both fact tables have 3 foreign keys that are used to join to the dimension tables
One security table also has 3 foreign keys that are used to join to the dimension tables

A developer is using my data to create an application that utilizes a columnar database. They have been having issues with performance, and when I suggested adding indexes / keys to their tables, they said that indexing a columnar database does not improve performance. As a result, they are asking me to combine the fact tables with the dimension tables.

This seems to contradict what I know about the fundamental principles of database management. Is it true that columnar databases cannot use indexes to improve performance? What steps should be taken to optimize columnar performance?

I am seeking high-level information, but for the sake of completeness, the relational database is Teradata, and the columnar database is SAP HANA.

Upvotes: 6

Answers (3)

Lena Weber

Reputation: 272

At a high level, the difference between relational and columnar databases is in how the data is stored. Relational DB's store records by the row, columnar by the column.

For example: Records:

Name          ID number        zip code
smith         4444             98210
jones         1234             10125

An RDBMS stores this is blocks by record: smith, 4444, 98210 and jones, 1234, 10125 The columnar DB stores this in blocks by column: smith, jones and 4444, 1234 and 98210, 10125

You can create indexes. In HANA there are UNIQUE, BTREE, CPBTREE indexes. Unique indexes on a unique value - like primary keys in RDBMS, BTree is a binary search tree index and CPBTREE is the compressed prefix B+ tree index.

However, it is important to evaluate the performance problem before creating indexes hoping for a fix. Look at the logs, analyze the DB and find out what is causing the slow performance. The comment "a developer is using my data to create an application that uses a columnar database" is likely the crux of the problem. The way the data is stored and retrieved in each database type is completely different. RDBMS are better suited to transactional data. So if this app is taking advantage of a columnar database then it is better suited to efficiently searching for specific data in large amounts of data -- as only the impacted columns need to be loaded, not the entire record.

This application may not run correctly simply due to the different DB structure.

Upvotes: 4

Lars Br.

Reputation: 10396

The general statement that indexes don’t provide an option for better performance in SAP HANA is not correct. There are clear cases for when an index can improve data access by orders of magnitudes.

As usual with database performance, there is more information needed than just “there are problems” to find the cause of slow performance. SAP HANA provides some specific development artefacts (Analytical Views and Calculation Views with Star Join) to support FACT-DIMENSION model queries. If these have been used, then reviewing the execution plan for the slow queries would be the next step.

Should that not lead to an approach to improve the performance, then using the PlanViz execution trace would be the next best bet. This allows to review which part of the query execution actually takes how much time.

That’s as far as high-level statements can take you here. Anything beyond this will require to look at the mentioned information and the query in question.

Upvotes: 0

Cristian Varela

Reputation: 3

I'm not that familiar with SAP HANA but in general Columnstore databases don't have indexes in the traditional relational sense of the word. Instead every column is like an individual index.

This type of DBs usually work well for analytical queries since typically they read a lot of data. Take for example any fact table where one of the Foreign Keys to a dimension traditionally would have a lot of repeated values (is assumed that the dimension would be much smaller in terms of rows than the fact tables).

If the rows are inserted in the fact table ordered by (among other) this column you can potentially achieve an excellent level of compression in the table and therefore it will require a lot less I/O from disk to read the table.

ie: col_fk_to_dim = [1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,5,5,5,5,5 ...]

could be compressed to [1x5, 2x3, 3x6, 4x1,5x5, ...]

additionally if the system is distributed over few nodes you need to take into consideration the distribution key to ensure every node has a similar share of the data to process.

If you're having performance issues the first thing I would check is the queries you're launching against the tables. Next check the columns they're being joined on and see if the fact table is being filled in sort order by those columns.

From there you can troubleshoot further.

Upvotes: 0

How does columnar database optimization differ from relational database optimization?

Answers (3)

Related Questions