Reputation: 755
In a PostgreSQL 9.5.3 DB, I have a credit_card_balances
table referencing a persons
table which tracks the balances of various credit cards associated to a particular person:
CREATE TABLE persons (
id serial PRIMARY KEY,
name text
);
CREATE credit_card_balances (
id serial PRIMARY KEY,
card_provider text,
person int REFERENCES persons,
balance decimal,
timestamp timestamp
);
Example row for credit_card_balances
:
id | card_provider | person | balance | timestamp
123 | visa | 1234 | 1.00 | 16-07-26 17:00
I need to retrieve the collection of people who have both a 'visa' and an 'amex' card, such that the most recent balance on the 'visa' card is larger than the most recent balance on the 'amex' card.
For each (person, card_provider)
there may be up to around 100 rows in the table. Ideally, the output columns would be:
person, provider1_balance, provider2_balance, provider1_timestamp, provider2_timestamp
I know that I can do something like
SELECT DISTINCT ON (card_provider) *
FROM credit_card_balances
WHERE person=1234
ORDER BY card_provider, timestamp DESC;
to get the most recent balances for each card for a particular person. But I'm not sure how to do that over all people and verify the conditions above, or if this is even the correct approach.
Edit: AS partially suggested in an answer, I can also do something like
SELECT * from credit_card_balances b1, credit_card_balances b2
WHERE b1.person = b2.person
AND (b1.card_provider = 'amex'
AND b1.timestamp in
(SELECT MAX(time_stamp)
FROM credit_card_balances
WHERE card_provider = 'amex'))
AND (b2.card_provider = 'visa'
AND <... same as above>)
AND b1.balance > b2.balance;
But I noticed that this leads to horrible performance. So I think this isn't such a good option.
Upvotes: 4
Views: 303
Reputation: 657082
This problem is a combination of two classics: greatest-n-per-group and relational-division.
Given your updated specifications and with up to around 100 rows per (person, card_provider)
, I would expect this query to be substantially faster than what we have so far:
SELECT a.person
, a.balance AS amex_balance
, v.balance AS visa_balance
, a.timestamp AS amex_timestamp
, v.timestamp AS visa_timestamp
FROM persons p
CROSS JOIN LATERAL (
SELECT balance, timestamp
FROM credit_card_balances
WHERE person = p.id
AND card_provider = 'amex' -- more selective credit card first to optimize
ORDER BY timestamp DESC
LIMIT 1
) a
JOIN LATERAL (
SELECT balance, timestamp
FROM credit_card_balances
WHERE person = p.id
AND card_provider = 'visa' -- 2nd cc
ORDER BY timestamp DESC
LIMIT 1
) v ON v.balance > a.balance;
Index support is crucial. This would be ideal for the case:
CREATE INDEX ON credit_card_balances (person, card_provider, timestamp DESC, balance);
Adding balance
as last index column only makes sense if you get index-only scans out of it.
This is assuming that timestamp
is defined NOT NULL
, else you may need to add need NULLS LAST
to query and index.
Related:
For only few rows per (person, card_provider)
an approach with DISTINCT ON
might be faster. A separate persons
table wouldn't help. The sweet spot depends on many factors.
Assuming at least several different credit cards.
DISTINCT ON
for one credit card, a LATERAL
subquery for the other:
SELECT a.person
, a.balance AS amex_balance
, v.balance AS visa_balance
, a.timestamp AS amex_timestamp
, v.timestamp AS visa_timestamp
FROM (
SELECT DISTINCT ON (person)
person, balance, timestamp
FROM credit_card_balances
WHERE card_provider = 'amex' -- the more selective credit card first
ORDER BY person, timestamp DESC
) a
JOIN LATERAL (
SELECT balance, timestamp
FROM credit_card_balances
WHERE card_provider = 'visa'
AND person = a.person
ORDER BY timestamp DESC
LIMIT 1
) v ON v.balance > a.balance
DISTINCT ON
for each credit card, then join:
SELECT a.person
, a.balance AS amex_balance
, v.balance AS visa_balance
, a.timestamp AS amex_timestamp
, v.timestamp AS visa_timestamp
FROM (
SELECT DISTINCT ON (person)
person, balance, timestamp
FROM credit_card_balances
WHERE card_provider = 'amex'
ORDER BY person, timestamp DESC
) a
JOIN (
SELECT DISTINCT ON (person)
person, balance, timestamp
FROM credit_card_balances
WHERE card_provider = 'visa'
ORDER BY person, timestamp DESC
) v USING (person)
WHERE v.balance > a.balance;
Or, my favorite: one DISTINCT ON
for both credit cards, then filtered aggregate with a HAVING
condition:
SELECT person
, max(balance) FILTER (WHERE card_provider = 'amex') AS amex_balance
, max(balance) FILTER (WHERE card_provider = 'visa') AS visa_balance
, max(timestamp) FILTER (WHERE card_provider = 'amex') AS amex_timestamp
, max(timestamp) FILTER (WHERE card_provider = 'visa') AS visa_timestamp
FROM (
SELECT DISTINCT ON (person, card_provider)
person, card_provider, balance, timestamp
FROM credit_card_balances
WHERE card_provider IN ('amex', 'visa')
ORDER BY person, card_provider, timestamp DESC
) c
GROUP BY person
HAVING max(balance) FILTER (WHERE card_provider = 'visa')
> max(balance) FILTER (WHERE card_provider = 'amex');
The aggregate FILTER
clause requires Postgres 9.4+:
Upvotes: 3
Reputation: 759
You can do it with helf of nested select and window function
select * from (
select *,
rank() over(partition by card_provider order by balance desc) as rank
from credit_card_balances
) credit_card_balances_ranked
where rank = 1
Upvotes: 0
Reputation: 1396
Use a self join. Something like:
SELECT * from credit_card_balances b1, credit_card_balances b2
WHERE b1.person = b2.person
AND b1.card_provider = 'amex'
AND b2.card_provider = 'visa'
AND b1.balance > b2.balance;
Combining this with more or less what you already came up with results in the following, using a view to keep the query easier to understand.
CREATE VIEW most_recent_balance AS
SELECT DISTINCT ON (person, card_provider) *
FROM credit_card_balances
GROUP BY id, person
ORDER BY person, card_provider, timestamp DESC;
Substitute this most_recent_balance view in place of the table in the self join query.
Upvotes: 0