Reputation: 187
I have a dataset with the 2016 primary election results. The dataset contains 8 columns: State, state_abbr, county, fips(which is the combined state and county id number), party, candidate, votes, and fraction votes. I want to create a new column called "result" that indicates a "won" or "loss" in each county for each candidate. I filtered the data using dplyr to the 2 democratic candidates, then used this code add the column:
Democrat$result <- ifelse(Democrat$fraction_votes > .5, "Win","Loss")
This is obviously not an accurate method, because the winner didn't always get 50% of the vote. How can I get R to compare the vote_fraction or vote totals for each county, and return a "win" or "loss?" Would the apply() family, for loop, or writing a function be the best way to create the new column?
state state_abbreviation county fips party candidate
1 Alabama AL Autauga 1001 Democrat Bernie Sanders
2 Alabama AL Autauga 1001 Democrat Hillary Clinton
3 Alabama AL Baldwin 1003 Democrat Bernie Sanders
4 Alabama AL Baldwin 1003 Democrat Hillary Clinton
5 Alabama AL Barbour 1005 Democrat Bernie Sanders
6 Alabama AL Barbour 1005 Democrat Hillary Clinton
votes fraction_votes
1 544 0.182
2 2387 0.800
3 2694 0.329
4 5290 0.647
5 222 0.078
6 2567 0.906
Upvotes: 3
Views: 582
Reputation: 38500
In base R, you can calculate a binary vector with ave
:
Democrat$winner <- ave(Democrat$fraction_votes, Democrat$fips, FUN=function(i) i == max(i))
which returns
Democrat
state state_abbreviation county fips party candidate votes fraction_votes winner
1 Alabama AL Autauga 1001 Democrat Bernie 544 0.182 0
2 Alabama AL Autauga 1001 Democrat Hillary 2387 0.800 1
3 Alabama AL Baldwin 1003 Democrat Bernie 2694 0.329 0
4 Alabama AL Baldwin 1003 Democrat Hillary 5290 0.647 1
5 Alabama AL Barbour 1005 Democrat Bernie 222 0.078 0
6 Alabama AL Barbour 1005 Democrat Hillary 2567 0.906 1
which could be converted to logical by wrapping the ave
in as.logical
if desired.
This is also quite straightforward in data.table
. Assuming that fips is the unique state-county ID:
library(data.table)
# convert to data.table
setDT(Democrat)
# get logical vector that proclaims winner if vote fraction is maximum
Democrat[, winner := fraction_votes == max(fraction_votes), by=fips]
which returns
Democrat
state state_abbreviation county fips party candidate votes fraction_votes winner
1: Alabama AL Autauga 1001 Democrat Bernie 544 0.182 FALSE
2: Alabama AL Autauga 1001 Democrat Hillary 2387 0.800 TRUE
3: Alabama AL Baldwin 1003 Democrat Bernie 2694 0.329 FALSE
4: Alabama AL Baldwin 1003 Democrat Hillary 5290 0.647 TRUE
5: Alabama AL Barbour 1005 Democrat Bernie 222 0.078 FALSE
6: Alabama AL Barbour 1005 Democrat Hillary 2567 0.906 TRUE
data
Democrat <-
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Alabama", class = "factor"),
state_abbreviation = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"),
county = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("Autauga",
"Baldwin", "Barbour"), class = "factor"), fips = c(1001L,
1001L, 1003L, 1003L, 1005L, 1005L), party = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Democrat", class = "factor"),
candidate = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Bernie",
"Hillary"), class = "factor"), votes = c(544L, 2387L, 2694L,
5290L, 222L, 2567L), fraction_votes = c(0.182, 0.8, 0.329,
0.647, 0.078, 0.906)), .Names = c("state", "state_abbreviation",
"county", "fips", "party", "candidate", "votes", "fraction_votes"
), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
Upvotes: 1
Reputation: 2644
I would first use summarise
function from dplyr
package to find the maximum number of votes any candidate received in a given county, then add the column with county maximum to the original dataset, then calculate the result.
# create a sample dataset akin to the question setup
df <- data.frame(abrev = rep("AL", 6), county = c("Autuga", "Autuga", "Baldwin", "Baldwin",
"Barbour", "Barbour"),
party = rep("Democrat", 6),
candidate = rep(c("Bernie", "Hillary"), 3),
fraction_votes = c(0.18, 0.8, 0.32, 0.64, 0.07, 0.9))
# load a dplyr library
library(dplyr)
# calculate what was the maximum ammount of votes candidate received in a given county
# take a df dataset
winners <- df %>%
# group it by a county
group_by(county) %>%
# for each county, calculate maximum of votes
summarise(score = max(fraction_votes))
# join the original dataset and the dataset with county maximumus
# join them by county column
df <- left_join(df, winners, by = c("county"))
# calculate the result column
df$result <- ifelse(df$fraction_votes == df$score, "Win", "Loss")
If there are different counties with same name, you would have to adjust the grouping and joining part, but the logic should be the same
Upvotes: 1