Reputation: 1603
I am working with proc SQL in SAS and one of my proc sql queries is behaving very oddly:
I have a large data set (about 1 Million rows), which looks something like this:
apple_key profit price cost months date
golden_d 0.03 12 4 3 01/12
golden_d 0.03 8 0 2 01/12
granny_s 0.05 15 5 5 02/12
red_d 0.04 13 0 1 01/12
golden_d 0.02 1 2 12 03/14
On this data set I am running the following query:
%let picking_date = 01/12; /* I simplify here - this part of my code definitely works */
proc sql;
CREATE TABLE output AS
SELECT
(CASE apple_key
WHEN "golden_d" THEN 1
WHEN "granny_s" THEN 2
WHEN "red_d" THEN 3
END) AS apple_id,
apple_key AS apple_name,
(CASE WHEN cost= 0 THEN 0
ELSE 1
END) AS cost_flag,
(CASE
WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2)
ELSE 5
END) AS age,
"McDonalds" as farm,
sum(profit*price)/sum(price) as price_weighted_profit
FROM input_table
WHERE date = "&picking_date."d
AND price > cost
AND cost >= 0
AND cost >= 0
GROUP BY apple_id, apple_name, cost_flag, age, farm
;
run;
When I run this my GROUP BY
statement does not work. I get a bunch of entries
for a single group (where apple_id, apple_name, cost_flag, age and farm are all the same, but my aggregation is not working).
However, when I run the GROUP BY separately (as follows) everything works perfectly fine. I get one entry for each group with a "price weighted profit":
proc sql;
CREATE TABLE output_tmp AS
SELECT
(CASE apple_key
WHEN "golden_d" THEN 1
WHEN "granny_s" THEN 2
WHEN "red_d" THEN 3
END) AS apple_id,
apple_key AS apple_name,
(CASE WHEN cost= 0 THEN 0
ELSE 1
END) AS cost_flag,
(CASE
WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2)
ELSE 5
END) AS age,
"McDonalds" as farm
FROM input_table
WHERE date = "&picking_date."d
AND price > cost
AND cost >= 0
AND cost >= 0
;
CREATE TABLE output AS
SELECT
apple_id,
apple_name,
cost_flag,
age,
farm,
sum(profit*price)/sum(price) as price_weighted_profit
FROM output_tmp
GROUP BY apple_id, apple_name, cost_flag, age, farm
;
quit;
Why is this happening? How can I fix it? This is driving me a little bit crazy... Thanks up front for the help
Upvotes: 0
Views: 794
Reputation: 1235
It doesn't work because group by is not taking sum(profit*price)/sum(price) statement as aggregated function. it does not do that because of aliases like age, cost_flag etc.
Anyways below is the right query:-
Proc sql;
CREATE TABLE output AS
SELECT
apple_id,
apple_name,
cost_flag,
age,
farm,
sum(profit*price)/sum(price) as price_weighted_profit
FROM
(
SELECT
(CASE apple_key
WHEN "golden_d" THEN 1
WHEN "granny_s" THEN 2
WHEN "red_d" THEN 3
END) AS apple_id,
apple_key AS apple_name,
(CASE WHEN cost= 0 THEN 0
ELSE 1
END) AS cost_flag,
(CASE
WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2)
ELSE 5
END) AS age,
"McDonalds" as farm
FROM input_table
WHERE date = "&picking_date."d
AND price > cost
AND cost >= 0
AND cost >= 0
) a
GROUP BY apple_id, apple_name, cost_flag, age, farm;
quit;
Let me know if you have any questions
Upvotes: 1
Reputation: 1269483
I suspect that what is happening is remerging. SAS proc sql accepts code like this:
proc sql;
select a.*, count(*)
from a;
This does not summarize the data. Instead it puts the overall count on each row. In other words, if the keys in the select
don't exactly match the group by
, then the aggregation functions are calculated based on the group by
keys, but the results are put onto the individual rows. Other databases do this with a subset of the window functions.
In your case, remerging is not obvious. I think there is key confusion because you are using the same names in the select
as in the original data. My advice is to change the aliases so they are unambiguous and be sure that the keys in the group by
are unambiguous.
Upvotes: 0
Reputation: 13524
Thumb rule :- Whenever you use any aggregate functions in your select clause the remaining columns should be part of the group by. In your question you posted you are apply sum(profit*price)/sum(price) but there is no group by which is causing the problem.
Proc sql;
CREATE TABLE output AS
SELECT
(CASE apple_key
WHEN "golden_d" THEN 1
WHEN "granny_s" THEN 2
WHEN "red_d" THEN 3
END) AS apple_id,
apple_key AS apple_name,
(CASE WHEN cost= 0 THEN 0
ELSE 1
END) AS cost_flag,
(CASE
WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2)
ELSE 5
END) AS age,
"McDonalds" as farm,
sum(profit*price)/sum(price) as price_weighted_profit
FROM input_table
WHERE date = "&picking_date."d
AND price > cost
AND cost >= 0
AND cost >= 0
GROUP BY apple_id, apple_name, cost_flag, age, farm;
quit;
Upvotes: 0