user1008537
user1008537

Reputation:

Why doesn't Redshift support the DOES EXIST correlated sub-query?

RedShift is known to have a set of correlated sub-queries that it can decorrelate and thereby make the query run faster.

Why can't Redshift decorrelate the common DOES NOT EXIST subquery? For example, the query below would be essentially:

SELECT ProductID,
       ProductName
FROM   Products p
WHERE  NOT EXISTS (SELECT *
                   FROM   [Order Details] od
                   WHERE  p.ProductId = od.ProductId) 

Can someone explain why Redshift can't decorrelate this query?

Upvotes: 1

Views: 5797

Answers (4)

Ernesto
Ernesto

Reputation: 609

This functionality IS supported by AWS Redshift. The documentation can be found on the AWS Redshift page here. The below explanation was taken from the aforementioned link.

The syntax is:

[ NOT ] EXISTS (table_subquery)

Where:

  • EXISTS Is true when the table_subquery returns at least one row.
  • NOT EXISTS Is true when the table_subquery returns no rows.
  • table_subquery A subquery that evaluates to a table with one or more columns and one or more rows.

Example:

select dateid from date
where exists (
select 1 from sales
where date.dateid = sales.dateid
)
order by dateid;

dateid
--------
1827
1828
1829
...

Upvotes: 1

Joe Harris
Joe Harris

Reputation: 14035

Redshift was not developed at Amazon but was acquired in exchange for an investment in ParAccel (since acquired by someone else). ParAccel were apparently desperate for cash and sold the family silver to Amazon.

Redshift is a subset of ParAccel's database offering at the time of the investment (presumed to be most of it but never clearly defined). Amazon did not acquire any engineers in this deal - just code.

Early on with Redshift we saw only very shallow changes to the product. It seemed clear that they were struggling to safely make deep changes. They have a solid engineering team now with plenty of knowledge about the code and improvements are rolling out at a decent pace.

All that is actually a long winded way to say that a database engine is an exceptionally complex and expensive piece of software to own. Amazon is notoriously, uh, "parsimonious" and I think the cost of building this team/experience is why we haven't (and won't) see price cuts on the product.

Curt Monash says:
"Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1."
http://www.dbms2.com/2013/03/18/dbms-development-marklogic-hadoop/

Upvotes: 0

John Rotenstein
John Rotenstein

Reputation: 269320

I have no insight into why that decision was made, but it's probably due to efficiency. Doing a NOT EXISTS means the database has to scan the entire sub-query to find the non-existence of a record. This is horribly inefficient and should be avoided when possible -- especially when dealing with Big Data.

I note that you are retrieving all columns from Orders, where only the ProductId is required. Given that Redshift is a columnar database, it is more efficient to select the minimum number of columns required.

It looks like you're trying to find products that don't exist on any orders, so try:

SELECT
  ProductID,
  ProductName
FROM
  Products
WHERE
  ProductID NOT IN (SELECT DISTINCT ProductId from Orders)

In English, this says "Select any Product that isn't on an Order".

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269743

I don't see anything in the RedShift documentation that precludes this. However, it is easy enough to express as a LEFT JOIN:

SELECT p.*
FROM Products p LEFT JOIN
     [Order Details] od
     ON p.ProductId = od.ProductId
WHERE od.ProductId IS NULL;

As to why Amazon chose particular functionality, you would have to ask their developers or marketers.

Upvotes: 1

Related Questions