Minh
Minh

Reputation: 2310

Shiny slow on Redshift

I'm trying to use Redshift for a Shiny app after using Postgres. So far, Redshift has been outperforming Postgres except when it comes to dashboards.

I would have something like this in global.R to pull in my data

    driver <- dbDriver("PostgreSQL")
    conn <- dbConnect(driver,
                      host="myhost.rds.amazonaws.com",
                      user="user",
                      password="password",
                      port=5432,
                      dbname="my_database"
    )
  query = "select * from customers"
  res <- dbSendQuery(conn, query))
  dataframe <- fetch(res, n = -1))

My approach with Redshift is very similar, except for some tweaks I've taken from here

  download.file('https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.12.1017/RedshiftJDBC42-1.2.12.1017.jar','RedshiftJDBC42-1.2.12.1017.jar')
  driver <- JDBC("com.amazon.redshift.jdbc42.Driver", "RedshiftJDBC42-1.2.12.1017.jar", identifier.quote="`")
  url <- "jdbc:redshift://host.redshift.amazonaws.com:5439/my_db?user=user&password=password"
  conn <- dbConnect(driver, url)
  query = "select * from customers"
  res <- dbSendQuery(conn, query))
  dataframe <- fetch(res, n = -1))

However despite being able to execute queries faster when I'm inside a session, it takes considerably longer to fetch the data into a dataframe from Redshift. I haven't found anything addressing this discrepancy. I've ran some tests

> test_time('redshift')
   user  system elapsed 
 30.870   7.236  21.423 
> test_time('postgres')
   user  system elapsed 
  1.877   0.595   2.577 

> test_time('redshift')
   user  system elapsed 
 31.137   7.294  22.743 
> test_time('postgres')
   user  system elapsed 
  1.900   0.520   2.466 

However I did realize that the Postgres takes alot more time to execute dbSendQuery while Redshift was alot quicker (but again, alot slower when fetching the results into a dataframe). Here are some metrics I've collected

> test_time('postgres')
   user  system elapsed # dbSendQuery
  1.230   1.201  23.708 
   user  system elapsed # fetch
  1.795   0.172   2.029 
> test_time('redshift')
   user  system elapsed # dbSendQuery
  0.006   0.004   0.091 
   user  system elapsed # fetch
 32.723   8.146  23.666 

I just want to understand what's going on behind the hood that would cause Redshift to not perform as well as I expected it to. My assuming would be that it takes more time to convert results from a columnar-based database into a row-based dataframe than it does from a traditional row-based databased to dataframe.

Upvotes: 2

Views: 364

Answers (1)

AlexYes
AlexYes

Reputation: 4208

"select star" query is not an columnar OLAP pattern it's expected to perform quickly on. To pull data from a columnar database, the engine needs to get every single column and materialize (turn into tuples) the result set. Conversion that you mentioned actually happens on the database side. It's supposed to work fast when aggregating vast amounts of data and returning a small summary table, that's when Postgres will be much slower. At the same time, data already sits as tuples in Postgres. So, if your primary designation of the database backend is the storage for subsequent analysis in some statistical package like R and Python you probably don't even need Redshift.

Upvotes: 3

Related Questions