Reputation: 2310
I'm trying to use Redshift for a Shiny app after using Postgres. So far, Redshift has been outperforming Postgres except when it comes to dashboards.
I would have something like this in global.R
to pull in my data
driver <- dbDriver("PostgreSQL")
conn <- dbConnect(driver,
host="myhost.rds.amazonaws.com",
user="user",
password="password",
port=5432,
dbname="my_database"
)
query = "select * from customers"
res <- dbSendQuery(conn, query))
dataframe <- fetch(res, n = -1))
My approach with Redshift is very similar, except for some tweaks I've taken from here
download.file('https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.12.1017/RedshiftJDBC42-1.2.12.1017.jar','RedshiftJDBC42-1.2.12.1017.jar')
driver <- JDBC("com.amazon.redshift.jdbc42.Driver", "RedshiftJDBC42-1.2.12.1017.jar", identifier.quote="`")
url <- "jdbc:redshift://host.redshift.amazonaws.com:5439/my_db?user=user&password=password"
conn <- dbConnect(driver, url)
query = "select * from customers"
res <- dbSendQuery(conn, query))
dataframe <- fetch(res, n = -1))
However despite being able to execute queries faster when I'm inside a session, it takes considerably longer to fetch the data into a dataframe from Redshift. I haven't found anything addressing this discrepancy. I've ran some tests
> test_time('redshift')
user system elapsed
30.870 7.236 21.423
> test_time('postgres')
user system elapsed
1.877 0.595 2.577
> test_time('redshift')
user system elapsed
31.137 7.294 22.743
> test_time('postgres')
user system elapsed
1.900 0.520 2.466
However I did realize that the Postgres takes alot more time to execute dbSendQuery
while Redshift was alot quicker (but again, alot slower when fetching the results into a dataframe). Here are some metrics I've collected
> test_time('postgres')
user system elapsed # dbSendQuery
1.230 1.201 23.708
user system elapsed # fetch
1.795 0.172 2.029
> test_time('redshift')
user system elapsed # dbSendQuery
0.006 0.004 0.091
user system elapsed # fetch
32.723 8.146 23.666
I just want to understand what's going on behind the hood that would cause Redshift to not perform as well as I expected it to. My assuming would be that it takes more time to convert results from a columnar-based database into a row-based dataframe than it does from a traditional row-based databased to dataframe.
Upvotes: 2
Views: 364
Reputation: 4208
"select star" query is not an columnar OLAP pattern it's expected to perform quickly on. To pull data from a columnar database, the engine needs to get every single column and materialize (turn into tuples) the result set. Conversion that you mentioned actually happens on the database side. It's supposed to work fast when aggregating vast amounts of data and returning a small summary table, that's when Postgres will be much slower. At the same time, data already sits as tuples in Postgres. So, if your primary designation of the database backend is the storage for subsequent analysis in some statistical package like R and Python you probably don't even need Redshift.
Upvotes: 3