Siva Kg
Siva Kg

Reputation: 59

Creating subgroups while keeping group

I have a SAS dataset, where each row contains information about firms. There is a variable about the sector of the company; sect. There are up to 700 sectors, so I need to group them, like:

Industry: mining, textiles industry, food industry... and so on. I do this with the following code:

data temp; set in.data;

if sect in ('01' '02' '03' '04' '05' '06')    then my_sector='Industry' ;

run;

The tricky part is ; I also want to give out details within each sector. For instance, I want to keep this whole "Industry" sector, but I also want to have subgroups within "Industry". But when I run the following code:

data temp; set in.data;

if sect in ('01' '02' '03' '04' '05' '06')    then my_sector='Industry' ;

if sect in ('01' '02')    then my_sector='Textile Industry' ;
if sect in ('03' '04')    then my_sector='Food Industry' ;
if sect in ('05')         then my_sector='Mining Industry' ;

run;

The subgroups work just fine, but the global "Industry" sector only contains what is not contained in other subgroups (here the sector 06). So mu question is; how do I have the 3 subgroups but also an "Industry" sector that has all the 3 subgroups and the rest of the industry (from 01 to 06)?

Thank you very much,

Upvotes: 1

Views: 85

Answers (2)

Richard
Richard

Reputation: 27498

For the case of

I also want to have subgroups

you will want to map 'sect' into a two variable hierarchy, such as parent, 'industry', and the child, 'sector'.

In general the control of the mapping is best set off into control tables that will let you easily handle to industry and sector names as new or additional 'sect' values come into play.

data sect_mappings;  length sect $2 group $8 sector $8; input
sect   group     sector; datalines;
01     Industry  Textile
02     Industry  Textile
03     Industry  Food
04     Industry  Food
05     Industry  Mining
run;

The control table can be left joined to the original data so as to map the 'sect' value to the 'group' and 'sector' names (i.e. formatted values of sect).

One of the strongest features in SAS is the concept of custom formatting, so you can actually leave the sect value alone, and process it automatically according to the formatted value.

A custom format can be created directly from the mapping table. In your case you want to two custom formats, one for mapping sect to group and the other for sect to sector.

data sector_cntlin;
  set sect_mappings;
  rename sect=start;

  fmtname = '$sect_group'; label=group;  output;
  fmtname = '$sect_name' ; label=sector; output;
run;

proc sort data=sector_cntlin;
  by fmtname;
run;

proc format cntlin=sector_cntlin;
run;

The formats can be used to SAS procedures that use sect as a by or class variable when aggregating information.

Suppose you have sample data:

data have;
  call streaminit(123);
  do company_id = 1 to 100;
    sect = put(ceil(rand('uniform', 10)), z2.);
    output;
  end;
run;

In order to have a two-tier hierarchy (using formatted values), you will need an additional variable that has the sect value repeated.

data want;
  set have;
  sect_repeat = sect;
run;

At this point a SAS procedure process the data in a two-level hierarchy if the sect and sect_repeat variables are the class variables.

proc tabulate data=want;
  class sect sect_repeat;

  format sect $sect_group.;
  format sect_repeat $sect_name.;

  label sect = 'Group';
  label sect_repeat = 'Sector';

  table   
    sect * (all sect_repeat)
  , n
  / nocellmerge
  ;
run;

Outputs a report such as the following. The unmapped values of sect are obvious and can be added to the mapping table, or filtered out with a where statement.

enter image description here

Upvotes: 1

Llex
Llex

Reputation: 1770

After execution first if-statement you overwrite value of variable. You should use other variable or output statement:

data want; set have;
length my_sector $20;

if sect in ('01' '02' '03' '04' '05' '06') then do;
   my_sector='Industry' ;
   output;
   if sect in ('01' '02') then do;
      my_sector='Textile Industry' ;
      output;
   end;
   if sect in ('03' '04') then do;
      my_sector='Food Industry' ;
      output;
   end;
   if sect in ('05') then do;
      my_sector='Mining Industry' ;
      output;
   end;
end;
run;

have dataset:

+------+
| sect |
+------+
|   01 |
|   02 |
|   03 |
|   04 |
|   05 |
|   06 |
+------+

want dataset:

+------+------------------+
| sect |    my_sector     |
+------+------------------+
|   01 | Industry         |
|   01 | Textile Industry |
|   02 | Industry         |
|   02 | Textile Industry |
|   03 | Industry         |
|   03 | Food Industry    |
|   04 | Industry         |
|   04 | Food Industry    |
|   05 | Industry         |
|   05 | Mining Industry  |
|   06 | Industry         |
+------+------------------+

Upvotes: 0

Related Questions