Rushil Zutshi
Rushil Zutshi

Reputation: 133

Splitting string data and corresponding data into new row

I have some data that has a string variable (US states), a corresponding integer variable (enrollment) and another string.

Unfortunately, some of the cells under the US states variable have multiple states listed separated by a semi-colon. I'd like to split these up into different rows and then divide the corresponding enrollment equally among those states.

For example, I have:

State       Enrollment   Severity
CA            100          Low
MA;PA         50           Medium
WA;OR;ID      120          High

And I want to be able to transform this into:

State       Enrollment    Severity
CA             100          Low
MA             25           Medium
PA             25           Medium
WA             40           High
OR             40           High
ID             40           High

I have tried separating them using the split command and then (in a convoluted fashion, calculating the corresponding enrollment) but I'm not quite sure how to get them on to new rows even with reshape.


EDIT:

I would also like the solution to be able to handle duplicate States.

For example:

State       Enrollment   Severity
CA            100          Low
MA;CA         50           Medium
WA;CA;ID      120          High

Transformed into:

State       Enrollment    Severity
CA             100          Low
MA             25           Medium
CA             25           Medium
WA             40           High
CA             40           High
ID             40           High

Upvotes: 0

Views: 325

Answers (1)

user8682794
user8682794

Reputation:

Here's one way to do what you want using your original data:

clear 
input str10 State Enrollment str10 Severity
"CA" 100 "Low"
"MA;PA" 50 "Medium"
"WA;OR;ID" 120 "High"
end

generate id = _n
split State, p(;)
drop State
reshape long State, i(State?)
drop State?

keep if State != ""
bysort State (id): egen maxval = max(id)
bysort State (id): generate enrol = Enrollment / maxval
drop Enrollment
rename enrol Enrollment

sort id
drop id _j maxval
order State Enrollment Severity

list, abbreviate(20)

     +-------------------------------+
     | State   Enrollment   Severity |
     |-------------------------------|
  1. |    CA          100        Low |
  2. |    MA           25     Medium |
  3. |    PA           25     Medium |
  4. |    OR           40       High |
  5. |    ID           40       High |
  6. |    WA           40       High |
     +-------------------------------+

EDIT:

Here's one way to do what you want using your revised data:

clear
input str10 State Enrollment str10 Severity
"CA"            100          "Low"
"MA;CA"         50           "Medium"
"WA;CA;ID"      120          "High"
end

generate id = _n
split State, p(;)
drop State

reshape long State, i(id)

keep if State != ""
bysort id: egen maxval = count(id)
bysort id: generate enrol = Enrollment / maxval
drop Enrollment
rename enrol Enrollment

sort id
drop id _j maxval
order State Enrollment Severity

list, abbreviate(20)

     +-------------------------------+
     | State   Enrollment   Severity |
     |-------------------------------|
  1. |    CA          100        Low |
  2. |    MA           25     Medium |
  3. |    CA           25     Medium |
  4. |    WA           40       High |
  5. |    CA           40       High |
  6. |    ID           40       High |
     +-------------------------------+

Upvotes: 2

Related Questions