Reputation: 95

If - Then with multiple characters and conditions

I hope someone can help me as my current approach with grepl does not lead to anything that works

I have several categories (stored as characters). I now want to build a variable that takes different values for different categories.

The data looks like the following

category                                 

Candidate Biography                        
Candidate Biography                         
Candidate Biography                         
Candidate Biography, Campaign Finance       
Justice, Candidate Biography, Economy       
Candidate Biography, Jobs                   
Economy, Education, Candidate Biography    
Economy, Civil Rights, Candidate Biography

Now I want to create new variables that can take different values according to the category like shown below

category                                 CandBio   Economy  CivilRights   Family
Candidate Biography                         1         0          0           0
Candidate Biography                         1         0          0           0
Candidate Biography                         1         0          0           0
Candidate Biography, Campaign Finance       0.5       0.5        0           0
Justice, Candidate Biography, Economy       0.33      0.33       0.33        0
Candidate Biography, Jobs                   0.5       0.5        0           0
Economy, Education, Candidate Biography     0.33      0.33       0           0.33
Economy, Civil Rights, Candidate Biography  0.33      0.33       0.33        0

Each category has a specific factor for each variable (and can load on different categories). E.g. "Candidate Biography, Campaign Finance" loads on CandBio and Economy 0.5 each. Categories re-occur for many observations within the dataset. (in total 49k obs with 120 different categories that need to be aggregated into 10 variables like CandBio, Economy, CivilRights, etc. in the example)

I first tried it combining ifelse and grepl, but I realized that grepl is very sensitive to order and that I can get fault categorizations for each category depending on how I structure my ifelse. Also I tried to get vactors with all category terms that share a similar number and to then include the vector in the grepl function but that didnt work either.

So I am looking for any solution that helps me to assign my weights to variable depending on the category text.

I hope I could clearly describe my problem and I am looking forward to any help, that is very much appreciated! Many thanks beforehand!

EDIT: So far I tried it this way, but with no success:

clintontvad$CandidateBiography <- ifelse(ifelse(grepl("Candidate Biography", clintontvad$subjects),1,
                                                ifelse(grepl("Candidate Biography, Marriage, Gays and Lesbians, Civil Rights, Immigration, Trade, Energy, Workers", clintontvad$subjects), 0.125, 
                                                ifelse(grepl("Candidate Biography, Terrorism, Islam, Foreign Policy, Nuclear, Iran", clintontvad$subjects),0.17,
                                                ifelse(grepl("Children, Candidate Biography, Families, Education, Debt, Economy, Jobs", clintontvad$subjects),0.17,
                                                       ifelse(grepl("Candidate Biography, Children, Education, Health Care, Women", clintontvad$subjects), 0.2,
                                                              ifelse(grepl("Candidate Biography, Civil Rights, Islam, Gays and Lesbians, Women", clintontvad$subjects), 0.2,
                                                                     ifelse(grepl("Candidate Biography, Economy, Election, Children, Families", clintontvad$subjects), 0.2,
                                                                            ifelse(grepl("Children, Education, Women, Economy, Families", clintontvad$subjects), 0.2,
                                                                                   ifelse(grepl("Job Accomplishments, Abortion, Women, Health Care, Climate Change, Marriage", clintontvad$subjects), 0.2,
                                                                                          ifelse(grepl("Women, Civil Rights, Gays and Lesbians, Foreign Policy, Canddate Biography", clintontvad$subjects), 0.25, 
                                                                                                 ifelse(grepl("Poverty, Health Care, Candidate Biography, Terrorism", clintontvad$subjects), 0.25,
                                                                                                        ifelse(grepl("Job Accomplishments, Foreign Policy, Health Care, Children", clintontvad$subjects), 0.25,
                                                                                                               ifelse(grepl("Foreign Policy, Terrorism, Candidate Biography", clintontvad$subjects),0.25,
                                                                                                                      ifelse(grepl("Ethics, Terrorism, Candidate Biography", clintontvad$subjects),0.25, 0)))))))))))))

Upvotes: 0

Answers (2)

Andrew

Reputation: 5138

As long as I understood correctly, here is one way to do it. You need a vector of matches for your category, and you will want to keep an eye on the case or if you have any special characters. But this should get your started. Let me know if you have any issues with it. Also, in hindsight, I named too many things "category" but you should get the idea. category1 2 and 3 refer to whatever makes up your broader groups (e.g., Economy and CivilRights). Last, if this is slow, it will probably be a lot faster to use a function from stringi instead of grepl. I can post an edit if this base solution is too slow.

# Example dataframe
df <- data.frame(category = c("cat 1a",
                        "cat 1a",
                        "cat 1a",
                        "cat 1a, cat 2a",
                        "cat 3a, cat 1a, cat 2b",
                        "cat 1a, cat 2c"),
                 stringsAsFactors = F)

# Create a list with strings split based on the comma
string_list <- strsplit(df$category, split = ",", fixed = TRUE)

# Pre defined categories
category1 <- c("cat 1a", "cat 1b", "cat 1c")
category2 <- c("cat 2a", "cat 2b", "cat 2c")
category3 <- c("cat 3a", "cat 3b", "cat 3c")

# Create new columns based on your categories
df$Category_1 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category1, collapse = "|"), unlist(string_list[x]))) / 
                          length(unlist(string_list[x])))
df$Category_2 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category2, collapse = "|"), unlist(string_list[x]))) / 
                          length(unlist(string_list[x])))
df$Category_3 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category3, collapse = "|"), unlist(string_list[x]))) / 
                          length(unlist(string_list[x])))

df
                category Category_1 Category_2 Category_3
1                 cat 1a  1.0000000  0.0000000  0.0000000
2                 cat 1a  1.0000000  0.0000000  0.0000000
3                 cat 1a  1.0000000  0.0000000  0.0000000
4         cat 1a, cat 2a  0.5000000  0.5000000  0.0000000
5 cat 3a, cat 1a, cat 2b  0.3333333  0.3333333  0.3333333
6         cat 1a, cat 2c  0.5000000  0.5000000  0.0000000

EDIT: using the data that @Gilean0709 kindly provided (and stringi, to make it faster), here is an udpdate:

# Example dataframe
df <- data.frame(category = c("Candidate Biography", "Candidate Biography", "Candidate Biography", 
                             "Candidate Biography, Campaign Finance", 
                             "Justice, Candidate Biography, Economy", "Candidate Biography, Jobs", 
                             "Economy, Education, Candidate Biography", 
                             "Economy, Civil Rights, Candidate Biography"), stringsAsFactors = F)


# Create a list with strings split based on the comma
string_list <- strsplit(df$category, split = ",", fixed = TRUE)

library(stringi)

# Pre defined categories
CandBio <- paste(c("Candidate Biography"), collapse = "|")
Economy <- paste(c("Campaign Finance", "Economy", "Jobs"), collapse = "|")
CivilRights <- paste(c("Justice", "Civil Rights"), collapse = "|")
Family <- paste(c("Education"), collapse = "|")

# Create new columns based on your categories
df$CandBio <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), CandBio)) / 
                          length(unlist(string_list[x])))
df$Economy <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), Economy)) / 
                          length(unlist(string_list[x])))
df$CivilRights <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), CivilRights)) / 
                          length(unlist(string_list[x])))
df$Family <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), Family)) / 
                          length(unlist(string_list[x])))

df %>%
  mutate_if(is.numeric, round, digits = 2)
                                    category CandBio Economy CivilRights Family
1                        Candidate Biography    1.00    0.00        0.00   0.00
2                        Candidate Biography    1.00    0.00        0.00   0.00
3                        Candidate Biography    1.00    0.00        0.00   0.00
4      Candidate Biography, Campaign Finance    0.50    0.50        0.00   0.00
5      Justice, Candidate Biography, Economy    0.33    0.33        0.33   0.00
6                  Candidate Biography, Jobs    0.50    0.50        0.00   0.00
7    Economy, Education, Candidate Biography    0.33    0.33        0.00   0.33
8 Economy, Civil Rights, Candidate Biography    0.33    0.33        0.33   0.00

Upvotes: 0

Gilean0709

Reputation: 1098

If I understood your example correctly, then the weights for the new variables depend on the number of categories in each row. In that case you can use a two step approach. First create your new variables and afterwards divide by the number of matched categories.

d <- data.frame(category = c("Candidate Biography", "Candidate Biography", "Candidate Biography", 
                             "Candidate Biography, Campaign Finance", 
                             "Justice, Candidate Biography, Economy", "Candidate Biography, Jobs", 
                             "Economy, Education, Candidate Biography", 
                             "Economy, Civil Rights, Candidate Biography"))

# create a list with all your new variables and their respective categories
categories <- list(
  CandBio = c("Candidate Biography"),   
  Economy = c("Campaign Finance", "Economy", "Jobs"), 
  CivilRights = c("Justice", "Civil Rights"), 
  Family = c("Education")
  )

# create the new variables
for (i in seq_along(categories)) {
  d[names(categories)[i]] <- grepl(paste0(categories[[i]], collapse = "|"), d[, "category"])
}

# divide by number of matched categories
d[, -1] <- d[, -1]/rowSums(d[, -1])

d
                                    category   CandBio   Economy CivilRights    Family
1                        Candidate Biography 1.0000000 0.0000000   0.0000000 0.0000000
2                        Candidate Biography 1.0000000 0.0000000   0.0000000 0.0000000
3                        Candidate Biography 1.0000000 0.0000000   0.0000000 0.0000000
4      Candidate Biography, Campaign Finance 0.5000000 0.5000000   0.0000000 0.0000000
5      Justice, Candidate Biography, Economy 0.3333333 0.3333333   0.3333333 0.0000000
6                  Candidate Biography, Jobs 0.5000000 0.5000000   0.0000000 0.0000000
7    Economy, Education, Candidate Biography 0.3333333 0.3333333   0.0000000 0.3333333
8 Economy, Civil Rights, Candidate Biography 0.3333333 0.3333333   0.3333333 0.0000000

Upvotes: 1

If - Then with multiple characters and conditions

Answers (2)

Related Questions