Reputation: 95
I hope someone can help me as my current approach with grepl does not lead to anything that works
I have several categories (stored as characters). I now want to build a variable that takes different values for different categories.
The data looks like the following
category
Candidate Biography
Candidate Biography
Candidate Biography
Candidate Biography, Campaign Finance
Justice, Candidate Biography, Economy
Candidate Biography, Jobs
Economy, Education, Candidate Biography
Economy, Civil Rights, Candidate Biography
Now I want to create new variables that can take different values according to the category like shown below
category CandBio Economy CivilRights Family
Candidate Biography 1 0 0 0
Candidate Biography 1 0 0 0
Candidate Biography 1 0 0 0
Candidate Biography, Campaign Finance 0.5 0.5 0 0
Justice, Candidate Biography, Economy 0.33 0.33 0.33 0
Candidate Biography, Jobs 0.5 0.5 0 0
Economy, Education, Candidate Biography 0.33 0.33 0 0.33
Economy, Civil Rights, Candidate Biography 0.33 0.33 0.33 0
Each category has a specific factor for each variable (and can load on different categories). E.g. "Candidate Biography, Campaign Finance" loads on CandBio and Economy 0.5 each. Categories re-occur for many observations within the dataset. (in total 49k obs with 120 different categories that need to be aggregated into 10 variables like CandBio, Economy, CivilRights, etc. in the example)
I first tried it combining ifelse and grepl, but I realized that grepl is very sensitive to order and that I can get fault categorizations for each category depending on how I structure my ifelse. Also I tried to get vactors with all category terms that share a similar number and to then include the vector in the grepl function but that didnt work either.
So I am looking for any solution that helps me to assign my weights to variable depending on the category text.
I hope I could clearly describe my problem and I am looking forward to any help, that is very much appreciated! Many thanks beforehand!
EDIT: So far I tried it this way, but with no success:
clintontvad$CandidateBiography <- ifelse(ifelse(grepl("Candidate Biography", clintontvad$subjects),1,
ifelse(grepl("Candidate Biography, Marriage, Gays and Lesbians, Civil Rights, Immigration, Trade, Energy, Workers", clintontvad$subjects), 0.125,
ifelse(grepl("Candidate Biography, Terrorism, Islam, Foreign Policy, Nuclear, Iran", clintontvad$subjects),0.17,
ifelse(grepl("Children, Candidate Biography, Families, Education, Debt, Economy, Jobs", clintontvad$subjects),0.17,
ifelse(grepl("Candidate Biography, Children, Education, Health Care, Women", clintontvad$subjects), 0.2,
ifelse(grepl("Candidate Biography, Civil Rights, Islam, Gays and Lesbians, Women", clintontvad$subjects), 0.2,
ifelse(grepl("Candidate Biography, Economy, Election, Children, Families", clintontvad$subjects), 0.2,
ifelse(grepl("Children, Education, Women, Economy, Families", clintontvad$subjects), 0.2,
ifelse(grepl("Job Accomplishments, Abortion, Women, Health Care, Climate Change, Marriage", clintontvad$subjects), 0.2,
ifelse(grepl("Women, Civil Rights, Gays and Lesbians, Foreign Policy, Canddate Biography", clintontvad$subjects), 0.25,
ifelse(grepl("Poverty, Health Care, Candidate Biography, Terrorism", clintontvad$subjects), 0.25,
ifelse(grepl("Job Accomplishments, Foreign Policy, Health Care, Children", clintontvad$subjects), 0.25,
ifelse(grepl("Foreign Policy, Terrorism, Candidate Biography", clintontvad$subjects),0.25,
ifelse(grepl("Ethics, Terrorism, Candidate Biography", clintontvad$subjects),0.25, 0)))))))))))))
Upvotes: 0
Views: 409
Reputation: 5138
As long as I understood correctly, here is one way to do it. You need a vector of matches for your category, and you will want to keep an eye on the case or if you have any special characters. But this should get your started. Let me know if you have any issues with it. Also, in hindsight, I named too many things "category" but you should get the idea. category1
2
and 3
refer to whatever makes up your broader groups (e.g., Economy
and CivilRights
). Last, if this is slow, it will probably be a lot faster to use a function from stringi
instead of grepl
. I can post an edit if this base solution is too slow.
# Example dataframe
df <- data.frame(category = c("cat 1a",
"cat 1a",
"cat 1a",
"cat 1a, cat 2a",
"cat 3a, cat 1a, cat 2b",
"cat 1a, cat 2c"),
stringsAsFactors = F)
# Create a list with strings split based on the comma
string_list <- strsplit(df$category, split = ",", fixed = TRUE)
# Pre defined categories
category1 <- c("cat 1a", "cat 1b", "cat 1c")
category2 <- c("cat 2a", "cat 2b", "cat 2c")
category3 <- c("cat 3a", "cat 3b", "cat 3c")
# Create new columns based on your categories
df$Category_1 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category1, collapse = "|"), unlist(string_list[x]))) /
length(unlist(string_list[x])))
df$Category_2 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category2, collapse = "|"), unlist(string_list[x]))) /
length(unlist(string_list[x])))
df$Category_3 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category3, collapse = "|"), unlist(string_list[x]))) /
length(unlist(string_list[x])))
df
category Category_1 Category_2 Category_3
1 cat 1a 1.0000000 0.0000000 0.0000000
2 cat 1a 1.0000000 0.0000000 0.0000000
3 cat 1a 1.0000000 0.0000000 0.0000000
4 cat 1a, cat 2a 0.5000000 0.5000000 0.0000000
5 cat 3a, cat 1a, cat 2b 0.3333333 0.3333333 0.3333333
6 cat 1a, cat 2c 0.5000000 0.5000000 0.0000000
EDIT: using the data that @Gilean0709 kindly provided (and stringi, to make it faster), here is an udpdate:
# Example dataframe
df <- data.frame(category = c("Candidate Biography", "Candidate Biography", "Candidate Biography",
"Candidate Biography, Campaign Finance",
"Justice, Candidate Biography, Economy", "Candidate Biography, Jobs",
"Economy, Education, Candidate Biography",
"Economy, Civil Rights, Candidate Biography"), stringsAsFactors = F)
# Create a list with strings split based on the comma
string_list <- strsplit(df$category, split = ",", fixed = TRUE)
library(stringi)
# Pre defined categories
CandBio <- paste(c("Candidate Biography"), collapse = "|")
Economy <- paste(c("Campaign Finance", "Economy", "Jobs"), collapse = "|")
CivilRights <- paste(c("Justice", "Civil Rights"), collapse = "|")
Family <- paste(c("Education"), collapse = "|")
# Create new columns based on your categories
df$CandBio <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), CandBio)) /
length(unlist(string_list[x])))
df$Economy <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), Economy)) /
length(unlist(string_list[x])))
df$CivilRights <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), CivilRights)) /
length(unlist(string_list[x])))
df$Family <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), Family)) /
length(unlist(string_list[x])))
df %>%
mutate_if(is.numeric, round, digits = 2)
category CandBio Economy CivilRights Family
1 Candidate Biography 1.00 0.00 0.00 0.00
2 Candidate Biography 1.00 0.00 0.00 0.00
3 Candidate Biography 1.00 0.00 0.00 0.00
4 Candidate Biography, Campaign Finance 0.50 0.50 0.00 0.00
5 Justice, Candidate Biography, Economy 0.33 0.33 0.33 0.00
6 Candidate Biography, Jobs 0.50 0.50 0.00 0.00
7 Economy, Education, Candidate Biography 0.33 0.33 0.00 0.33
8 Economy, Civil Rights, Candidate Biography 0.33 0.33 0.33 0.00
Upvotes: 0
Reputation: 1098
If I understood your example correctly, then the weights for the new variables depend on the number of categories in each row. In that case you can use a two step approach. First create your new variables and afterwards divide by the number of matched categories.
d <- data.frame(category = c("Candidate Biography", "Candidate Biography", "Candidate Biography",
"Candidate Biography, Campaign Finance",
"Justice, Candidate Biography, Economy", "Candidate Biography, Jobs",
"Economy, Education, Candidate Biography",
"Economy, Civil Rights, Candidate Biography"))
# create a list with all your new variables and their respective categories
categories <- list(
CandBio = c("Candidate Biography"),
Economy = c("Campaign Finance", "Economy", "Jobs"),
CivilRights = c("Justice", "Civil Rights"),
Family = c("Education")
)
# create the new variables
for (i in seq_along(categories)) {
d[names(categories)[i]] <- grepl(paste0(categories[[i]], collapse = "|"), d[, "category"])
}
# divide by number of matched categories
d[, -1] <- d[, -1]/rowSums(d[, -1])
d
category CandBio Economy CivilRights Family
1 Candidate Biography 1.0000000 0.0000000 0.0000000 0.0000000
2 Candidate Biography 1.0000000 0.0000000 0.0000000 0.0000000
3 Candidate Biography 1.0000000 0.0000000 0.0000000 0.0000000
4 Candidate Biography, Campaign Finance 0.5000000 0.5000000 0.0000000 0.0000000
5 Justice, Candidate Biography, Economy 0.3333333 0.3333333 0.3333333 0.0000000
6 Candidate Biography, Jobs 0.5000000 0.5000000 0.0000000 0.0000000
7 Economy, Education, Candidate Biography 0.3333333 0.3333333 0.0000000 0.3333333
8 Economy, Civil Rights, Candidate Biography 0.3333333 0.3333333 0.3333333 0.0000000
Upvotes: 1