SQLAlchemy filter out lines that are not distinct across multiple columns

Question

Say I have data on a table as such:

id | other_id | category | amount
--------------------------------
1  | abc      | widget   | 100
2  | abc      | widget   | 200
3  | def      | widget   | 100
4  | ghi      | gadget   | 100
5  | ghi      | gadget   | 100
6  | jkl      | gadget   | 100
7  | jkl      | gadget   | 100

I want to query this table to return

(other_id, category, sum_of_amount)

where sum_of_amount is the sum of the amount column over the all rows in the same other_id. In addition to this, I want to exclude the tuples where the combination of category and sum_of_amount are NOT unique. So the query should return tuples:

(abc, widget, 300)
(def, widget, 100)

and not any of the gadget rows because the combination (gadget, 200) is not unique.

So far, I have this:

with session_scope() as db_session:
  query = db_session.query(
    ModelClass.other_id,
    ModelClass.category,
    label('sum_of_amount', func.sum(ModelClass.amount))
  ).group_by(
    ModelClass.other_id,
    ModelClass.category
  )

This query is not filtering anything out. I think I need to use distinct somehow but I can't figure out.

wsn · Accepted Answer

You can generate the result set of (ohter_id, category, sum_of_amount) with the query as you noted::

=# SELECT other_id, category, SUM(amount) AS sum_of_amount
FROM test
GROUP BY other_id, category;

 other_id │ category │ sum_of_amount
──────────┼──────────┼───────────────
 abc      │ widget   │           300
 ghi      │ gadget   │           200
 jkl      │ gadget   │           200
 def      │ widget   │           100
(4 rows)

And then, you have to exclude rows whose (category, sum_of_amount) are not unique. In above result set to determine every rows' uniqueness, you can add new column containing the number of rows having same (category, sum_of_amount) like below:

=# SELECT other_id, category, SUM(amount) AS sum_of_amount,
COUNT(*) OVER (PARTITION BY category, SUM(amount)) AS duplicates
FROM test
GROUP BY other_id, category;

 other_id │ category │ sum_of_amount │ duplicates
──────────┼──────────┼───────────────┼───────
 ghi      │ gadget   │           200 │     2
 jkl      │ gadget   │           200 │     2
 def      │ widget   │           100 │     1
 abc      │ widget   │           300 │     1
(4 rows)

As you can see in above demonstration, you have determinants in the hands. Now you can generate the result set that you want to find by adding WHERE clause using the duplicates column. Since window functions(the OVER part of duplicates column) are not allowed in WHERE clauses and we have to evaluate the result of window function after sums of amounts have calculated, we have to use subquery.

=# SELECT other_id, category, sum_of_amount
FROM (
  SELECT other_id, category, SUM(amount) AS sum_of_amount,
  COUNT(*) OVER (PARTITION BY category, SUM(amount)) AS duplicates
  FROM test
  GROUP BY other_id, category
) d
WHERE duplicates = 1
ORDER BY other_id, category;

 other_id │ category │ sum_of_amount
──────────┼──────────┼───────────────
 abc      │ widget   │           300
 def      │ widget   │           100
(2 rows)

SQLAlchemy query expression of above SQL could be:

from sqlalchemy import func, over
sum_of_amount = label('sum_of_amount', func.sum(ModelClass.amount))
duplicates = over(func.count('*'),
                  partition_by=(ModelClass.category, sum_of_amount))
query = db_session.query(
    ModelClass.other_id,
    ModelClass.category,
    sum_of_amount,
    duplicates,
  ).group_by(
    ModelClass.other_id,
    ModelClass.category
  ).from_self(
    ModelClass.other_id,
    ModelClass.category,
    sum_of_amount
  ).filter(duplicates == 1)

SQLAlchemy filter out lines that are not distinct across multiple columns

Answers (1)

Related Questions