Reputation: 499
I have a data set with information on students' educations on a institution. I want to get a number of how many different combinations of study programmes they have been on. I have information on both master and bachelor level and I want to count the number of different study programmes in each education level (master, bachelor).
For example person1 can have:
Bachelor:
- study1
- study2
- study3
- study3
Master:
- studyA
- studyA
Then I want a number of 3 study programmes in bachelor level (study3 should not Count twice), and a number of 1 in masters level. Each study programme has its own row - so in the dataset person1 has 6 rows. I want one row per person telling the number of study programmes per education level:
person number_bachelor number_master
person1 3 1
....etc...
I have tried with this:
proc sql;
create table new as
select distinct personid, name,
count(study) as number_of_bach
from old
group by personid, edu_level, study;
quit;
But it doesn't give me what I want. This gives me two rows with person1 with the values of 1 and 2 in the variable "number_of_bach".
How can I edit this code to get the result I want?
Upvotes: 0
Views: 1067
Reputation: 313
Code:
data education;
input person $ level $ program $;
datalines;
person1 bachelor study1
person1 bachelor study2
person1 bachelor study3
person1 bachelor study3
person1 master study1
person2 bachelor study1
person2 master study2
person2 master study1
;
run;
proc sort data = education nodupkey;
by person level program;
run;
proc sql;
select person,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_bachelors,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_masters
from education
group by person;
quit;
Working: Here, SORT procedure will eliminate duplicate records, if any. Then SQL procedure only can be used to generate the person wise count of programs at bachelor level as well as count of programs at master level.
Output:
person num_bachelors num_masters
person1 3 1
person2 1 2
Upvotes: 2
Reputation: 415
Is this what you want?
DATA old;
INPUT personid edu_level $ study $;
DATALINES;
1 bachelor study1
1 bachelor study2
1 bachelor study3
1 bachelor study3
1 master studyA
1 master studyA
1 master studyB
;
PROC SQL;
CREATE TABLE new AS
SELECT personid, edu_level, COUNT (DISTINCT study) AS num_bach
FROM OLD
GROUP BY personid, edu_level;
QUIT;
The column study
is a so-called an aggregate column in your query (because COUNT
is an aggregate function) and as such should not be included in the GROUP BY
-clause (else your query will also groupy by 'study' and the count will always be 1.
If you want to have one each person on one line then add a PROC TRANSPOSE
:
PROC transpose IN = new OUT = new2;
BY personid;
ID edu_level;
RUN;
(You could also create a more complex query using subqueries and joins instead of the transpose, as long as you don't have millions of rows the overhead for the TRANSPOSE
doesn't matter)
For the sake of completeness here is a SQL-only solution to your question:
PROC SQL;
CREATE TABLE new AS
SELECT p.personid, b.num_bachelors, m.num_masters
/* Select unique personids */
FROM (SELECT DISTINCT personid
FROM old) AS p
/* Count number of bachelor-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_bachelors
FROM old WHERE edu_level = 'bachelor'
GROUP BY personid) AS b on p.personid = b.personid
/* Count number of master-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_masters
FROM old WHERE edu_level = 'master'
GROUP BY personid) AS m on p.personid = m.personid;
QUIT;
Upvotes: 0