Doug Fir
Doug Fir

Reputation: 21212

Group summary statistics by year based on a date field

I'm aware that those who follow the SAS tag are used to some pretty advanced stuff on this forum. I'm just trying to learn my ABCs right now and am thinking in terms of my experience using SQL or just Excel. I want to know how to group by the year part of a date column. I have done some Google searching and got lost.

If I have a sample data set like this:

Date    Close   Volume
12/31/2014  222.41  2402097
12/30/2014  222.23  2903242
12/29/2014  225.71  2811828
12/26/2014  227.82  3327016
12/24/2014  222.26  1333518
12/23/2014  220.97  4513321
12/22/2014  222.6   4806917
12/19/2014  219.29  6910461
12/18/2014  218.26  7483349
12/17/2014  205.82  7367834
12/16/2014  197.81  8426105
12/15/2014  204.04  5218252
12/12/2014  207 7173782

And this data actually covers two years - 2013 - 14. If I want to group by year, so for 2013 and 2014, how do I, for example:

proc means data=tsla;
// something here to break out Date by 2013 and 2014?
run;  

Gives: enter image description here

Is there a way to tell SAS to break it out by year(Date) or similar syntax?

Similar for:

proc univariate data=tsla; 
var Close Volume; 
run;

Upvotes: 0

Views: 3444

Answers (2)

Joe
Joe

Reputation: 63424

Since you seem to need a bit more explicit detail, I'll explain a few major concepts here that will be helpful.

First off, SAS has two basic types: numeric and character. It has a few derived types, which aren't truly types - they're ways of interpreting numbers. Date is one of those derived types. A date value is stored in a numeric field as the number of days since 1/1/1960; so today is 20117.

SAS then uses formats to convert that underlying number to something visually useful. Excel works exactly the same way, except with a different start date (1/1/1900 for Excel), so that is often helpful to explain. SAS's formats are Excel's Number Formats, same exact idea except a lot more powerful.

Those formats always start with characters, and then end with a period or a period followed by a number. For example, mmddyy10. is the mmddyyw. format, where w.d means width.decimal (width = total number of characters to display it with, decimal = number of digits after the decimal), except mmddyyw. doesn't have decimals for obvious reasons.

SAS also has informats, which are like formats except going the other way: "How do I store this textual value as a number?". So if MMDDYYw. format converts 20117 to 01/29/2015, the MMDDYYw. informat converts 01/29/2015 to 20117. You have to convert if you want to take advantage of SAS's formatting capabilities: SAS can't take 01/29/2015 and figure out how to display it as 2015-01-29, bu it can convert it to 20117 and then display that any number of ways.

So, what Keith is doing in his answer, and what I agree is one good solution (and what I'd do), is using the mmddyy10. informat to convert 01/29/2015 to 20117, and then using the YEAR4. format (YEARw. where w=4) to display any dates that fall in 2014 as 2014, and any date in 2013 as 2013. It's sort of a trick, but it's a good one: it doesn't require creating a new variable, and the dates still have all of the information they used to - they're just displayed on the screen as 2014.

The other thing he's doing is taking advantage of how proc means (and most procs) work. If they have a summarizing/grouping functionality, that functionality works based on the formatted value of the field being grouped by. For proc means, the primary grouping method is class - it basically says 'Summarize by each of the unique class variable values. Works nearly identically togroup by` in SQL. So:

proc means data=<yourdata>;
  class date;
  var close volume;
run;

will summarize close and volume for each unique value of date. Except here that means it will summarize by each day separately - that's not useful!

proc means data=<yourdata>;
  class date;
  var close volume;
  format date YEAR4.;
run;

Ahah. Now, date is summarized by its formatted value by year. So your output will be something like

Date  close volume
2013  15.34 123135342
2014  16.13 151232144

(except a bit more since you're not only getting a single mean value, but also n and stdev and max/min).

This is functionally similar to adding a year column, which might be easier to understand. Say you have a dataset have with this data in it.

data for_means;
  set have;
  year = year(date);
run;

This would store the year of the date (Assuming it was stored as a SAS date, a numeric) in a new column, year. Then you could do:

proc means data=for_means;
  class year;
  var close volume;
run;

But Keith's method - using the format - is easier and faster.

Upvotes: 3

Longfish
Longfish

Reputation: 7602

One advantage with SAS is that it groups by formatted values when using summary procedures. This means that if you format the Date field as a year, it will group by year without having to create an extra column.

data have;
input Date : mmddyy10.  Close   Volume;
format date mmddyy10.;
datalines;
12/31/2014  222.41  2402097
12/30/2014  222.23  2903242
12/29/2014  225.71  2811828
12/26/2014  227.82  3327016
12/24/2014  222.26  1333518
12/23/2014  220.97  4513321
12/22/2014  222.6   4806917
12/19/2014  219.29  6910461
12/18/2014  218.26  7483349
12/17/2014  205.82  7367834
12/16/2014  197.81  8426105
12/15/2014  204.04  5218252
12/12/2014  207 7173782
;
run;

proc means data=have sum mean;
class date;
format date year.;
var close volume;
run;

Upvotes: 4

Related Questions