How to efficiently query a large database on a hourly basis?

Question

Background:

I have multiple asset tables stored in a redshift database for each city, 8 cities in total. These asset tables display status updates on an hourly basis. 8 SQL tables and about 500 mil rows of data in a year. (I also have access to the server that updates this data every minute.)

Example: One market can have 20k assets displaying 480k (20k*24 hrs) status updates a day.

These status updates are in a raw format and need to undergo a transformation process that is currently written in a SQL view. The end state is going into our BI tool (Tableau) for external stakeholders to look at.

Problem:

The current way the data is processed is slow and inefficient, and probably not realistic to run this job on an hourly basis in Tableau. The status transformation requires that I look back at 30 days of data, so I do need to look back at the history throughout the query.

Possible Solutions:

Here are some solutions that I think might work, I would like to get feedback on what makes the most sense in my situation.

Run a python script that looks at the most recent update and query the large history table 30 days as a cron job and send the result to a table in the redshift database.
Materialize the SQL view and run an incremental refresh every hour
Put the view in Tableau as a datasource and run an incremental refresh every hour

Please let me know how you would approach this problem. My knowledge is in SQL, limited Data Engineering experience, Tableau (Prep & Desktop) and scripting in Python or R.

Bill Weiner · Accepted Answer

So first things first - you say that the data processing is "slow and inefficient" and ask how to efficiently query a large database. First I'd look at how to improve this process. You indicate that the process is based on the past 30 days of data - is the large tables time sorted, vacuumed and analyzed? It is important to take maximum advantage of metadata when working with large tables. Make sure your where clauses are effective at eliminating fact table block - don't rely on dimension table where clauses to select the date range.

Next look at your distribution keys and how these are impacting the need for your critical query to move large amounts of data across the network. The internode network has the lowest bandwidth in a Redshift cluster and needlessly pushing lots of data across it will make things slow and inefficient. Using EVEN distribution can be a performance killer depending on your query pattern.

Now let me get to your question and let me paraphrase - "is it better to use summary tables, materialized views, or external storage (tableau datasource) to store summary data updated hourly?" All 3 work and each has its own pros and cons.

Summary tables are good because you can select the distribution of the data storage and if this data needs to be combined with other database tables it can be done most efficiently. However, there is more data management to be performed to keep this data up to data and in sync.
Materialized views are nice as there is a lot less management action to worry about - when the data changes, just refresh the view. The data is still in the database so is is easy to combine with other data tables but since you don't have control over storage of the data these action may not be the most efficient.
External storage is good in that the data is in your BI tool so if you need to refetch the results during the hour the data is local. However, it is not locked into your BI tool and far less efficient to combine with other database tables.

Summary data usually isn't that large so how it is stored isn't a huge concern and I'm a bit lazy so I'd go with a materialized view. Like I said at the beginning I'd first look at the "slow and inefficient" queries I'm running every hour first.

Hope this helps

How to efficiently query a large database on a hourly basis?

Background:

Problem:

Possible Solutions:

Answers (1)

Related Questions