Reputation: 703
In my data.frame I have three columns on the SUBJECT of a row. I want an additional column with a unique subject for each row. First, how my data looks like:
DATE <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7")
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
CLASS <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral")
SUBJECT.1 <- c("LITIGATION","LAYOFF","POLLUTION","CHEMICAL DISASTER","PRESS RELEASE","PEOPLE","EMISSIONS","ENERGY","WASTE MANAGEMENT","EMPLOYEES","MANAGEMENT","PRESS RELEASE","HOTELS","POLLUTION")
SUBJECT.2 <- c("POLLUTION","EMPLOYEES","NUCLEAR","FUELS","STOCK OPTION PLAN","EXECUTIVES","CO2","SOLAR","POLLUTION","EXECUTIVES","PRESS RELEASE","CELEBRITIES","CELEBRITIES","LITIGATION")
SUBJECT.3 <- c("ENVIRONMENT","JOB REDUCTIONS","POWER PLANTS","POLLUTION","EMPLOYEES","FRAUD","CLIMATE CHANGE","SUSTAINABILITY","HAZARDOUS WASTE","BONUS PAY","LITIGATION","EMISSIONS","SCANDALS","SCANDALS")
CONTROLVAR <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12")
mydf <- data.frame(DATE, COMP, RET, CLASS, SUBJECT.1, SUBJECT.2, SUBJECT.3, CONTROLVAR, stringsAsFactors=F)
mydf
# DATE COMP RET CLASS SUBJECT.1 SUBJECT.2 SUBJECT.3 CONTROLVAR
# 1 1 A -2.00 positive LITIGATION POLLUTION ENVIRONMENT 11
# 2 2 A 1.10 negative LAYOFF EMPLOYEES JOB REDUCTIONS 13
# 3 3 A 3.00 aneutral POLLUTION NUCLEAR POWER PLANTS 13
# 4 4 A 1.40 positive CHEMICAL DISASTER FUELS POLLUTION 14
# 5 5 A -0.20 positive PRESS RELEASE STOCK OPTION PLAN EMPLOYEES 13
# 6 6 A 0.60 negative PEOPLE EXECUTIVES FRAUD 14
# 7 7 A 0.10 aneutral EMISSIONS CO2 CLIMATE CHANGE 12
# 8 1 B -0.21 positive ENERGY SOLAR SUSTAINABILITY 11
# 9 2 B -1.20 negative WASTE MANAGEMENT POLLUTION HAZARDOUS WASTE 13
# 10 3 B 0.90 negative EMPLOYEES EXECUTIVES BONUS PAY 13
# 11 4 B 0.30 positive MANAGEMENT PRESS RELEASE LITIGATION 14
# 12 5 B -0.10 aneutral PRESS RELEASE CELEBRITIES EMISSIONS 13
# 13 6 B 0.30 aneutral HOTELS CELEBRITIES SCANDALS 14
# 14 7 B -0.12 aneutral POLLUTION LITIGATION SCANDALS 12
Since I want to include the subject as a dummy variable (which should be exclusive) for a later regression, I want a single column SUBJECT with a unique subject for each row. I'd like to focus on the subjects LITIGATION, POLLUTION and LAYOFF.
I want to go from left to right and check each SUBJECT column for LITIGATION, POLLUTION and LAYOFF.
If there is one of the three subjects LITIGATION, POLLUTION or LAYOFF in the first column, this subject is taken. If there is a different subject in the first column, I check the second column and so on. If none of the three subject-columns contain LITIGATION, POLLUTION or LAYOFF, then the subject should be called OTHER. Also, some of the subjects should be grouped. In this example EMISSIONS should be treated as if it was POLLUTION.
The output should look like this:
# DATE COMP RET CLASS SUBJECT.1 SUBJECT.2 SUBJECT.3 SUBJECT CONTROLVAR
# 1 1 A -2.00 positive LITIGATION POLLUTION ENVIRONMENT LITIGATION 11
# 2 2 A 1.10 negative LAYOFF EMPLOYEES JOB REDUCTIONS LAYOFF 13
# 3 3 A 3.00 aneutral POLLUTION NUCLEAR POWER PLANTS POLLUTION 13
# 4 4 A 1.40 positive CHEMICAL DISASTER FUELS POLLUTION POLLUTION 14
# 5 5 A -0.20 positive PRESS RELEASE STOCK OPTION PLAN EMPLOYEES OTHER 13
# 6 6 A 0.60 negative PEOPLE EXECUTIVES FRAUD OTHER 14
# 7 7 A 0.10 aneutral EMISSIONS CO2 CLIMATE CHANGE POLLUTION 12
# 8 1 B -0.21 positive ENERGY SOLAR SUSTAINABILITY OTHER 11
# 9 2 B -1.20 negative WASTE MANAGEMENT POLLUTION HAZARDOUS WASTE POLLUTION 13
# 10 3 B 0.90 negative EMPLOYEES EXECUTIVES BONUS PAY OTHER 13
# 11 4 B 0.30 positive MANAGEMENT PRESS RELEASE LITIGATION LITIGATION 14
# 12 5 B -0.10 aneutral PRESS RELEASE CELEBRITIES EMISSIONS POLLUTION 13
# 13 6 B 0.30 aneutral HOTELS CELEBRITIES SCANDALS OTHER 14
# 14 7 B -0.12 aneutral POLLUTION LITIGATION SCANDALS POLLUTION 12
Thank you!
Upvotes: 0
Views: 148
Reputation: 363
here is a couple of line solution which involves just chaining together 3 ifelse()
.
important <- c('LITIGATION','POLLUTION','LAYOFF','EMISSIONS')
ifelse( mydf$SUBJECT.1 %in% important, mydf$SUBJECT.1,
ifelse( mydf$SUBJECT.2 %in% important, mydf$SUBJECT.2,
ifelse( mydf$SUBJECT.3 %in% important, mydf$SUBJECT.3,'OTHER')))
mydf$SUBJECT[mydf$SUBJECT=='EMISSIONS'] <- 'POLUTION'
Upvotes: 0
Reputation: 92310
mydf$SUBJECT <- "OTHER"
sapply(c("SUBJECT.3", "SUBJECT.2", "SUBJECT.1"), function(x) mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), "SUBJECT"] <<- mydf[mydf[, x] %in% c("LITIGATION", "POLLUTION", "LAYOFF", "EMISSIONS"), x])
mydf$SUBJECT[mydf$SUBJECT == "EMISSIONS"] <- "POLLUTION"
Upvotes: 2