Rahul Gupta
Rahul Gupta

Reputation: 1802

How do I use Redshift Database for Transformation and Reporting?

I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?

Upvotes: 0

Views: 321

Answers (1)

BigDataKid
BigDataKid

Reputation: 1227

Near Real Time Data Loads in Amazon Redshift

I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.

Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.

You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.

It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.

This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.

In order to summarize it:

  • Break input files --- Break your load files in several smaller files that are a multiple of the number of slices
  • Column encoding --- Have column encoding pre-defined in your DDL.
  • COPY Settings --- Ensure COPY does not attempt to evaluate the best encoding for each load
  • Load in SORT key order --- If possible your input files should have the same "natural order" as your sort key
  • Staging Tables --- Use multiple staging tables and load them in parallel.
  • Multiple Time Series Tables --- This documented approach for dealing with time-series in Redshift
  • ELT --- Do transformations in-database using SQL to load into the main fact table.

Of course all the recommendations for data loading in Redshift still apply. Look at this article here.

Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.

Upvotes: 1

Related Questions