Andrew Lastrapes
Andrew Lastrapes

Reputation: 187

Adding a new variable by comparing existing variables in a data frame in r

I have a dataset with the 2016 primary election results. The dataset contains 8 columns: State, state_abbr, county, fips(which is the combined state and county id number), party, candidate, votes, and fraction votes. I want to create a new column called "result" that indicates a "won" or "loss" in each county for each candidate. I filtered the data using dplyr to the 2 democratic candidates, then used this code add the column:

 Democrat$result <- ifelse(Democrat$fraction_votes > .5, "Win","Loss")

This is obviously not an accurate method, because the winner didn't always get 50% of the vote. How can I get R to compare the vote_fraction or vote totals for each county, and return a "win" or "loss?" Would the apply() family, for loop, or writing a function be the best way to create the new column?

  state state_abbreviation  county fips    party       candidate
   1    Alabama  AL         Autauga 1001 Democrat  Bernie Sanders
   2    Alabama  AL         Autauga 1001 Democrat Hillary Clinton
   3 Alabama    AL          Baldwin 1003 Democrat  Bernie Sanders
   4 Alabama   AL           Baldwin 1003 Democrat Hillary Clinton
   5 Alabama   AL           Barbour 1005 Democrat  Bernie Sanders
   6 Alabama   AL            Barbour 1005 Democrat Hillary Clinton
    votes fraction_votes
    1   544          0.182
    2  2387          0.800
     3  2694          0.329
     4  5290          0.647
    5   222          0.078
    6  2567          0.906

Upvotes: 3

Views: 582

Answers (2)

lmo
lmo

Reputation: 38500

In base R, you can calculate a binary vector with ave:

Democrat$winner <- ave(Democrat$fraction_votes, Democrat$fips, FUN=function(i) i == max(i))

which returns

Democrat
    state state_abbreviation  county fips    party candidate votes fraction_votes winner
1 Alabama                 AL Autauga 1001 Democrat    Bernie   544          0.182      0
2 Alabama                 AL Autauga 1001 Democrat   Hillary  2387          0.800      1
3 Alabama                 AL Baldwin 1003 Democrat    Bernie  2694          0.329      0
4 Alabama                 AL Baldwin 1003 Democrat   Hillary  5290          0.647      1
5 Alabama                 AL Barbour 1005 Democrat    Bernie   222          0.078      0
6 Alabama                 AL Barbour 1005 Democrat   Hillary  2567          0.906      1

which could be converted to logical by wrapping the ave in as.logical if desired.


This is also quite straightforward in data.table. Assuming that fips is the unique state-county ID:

library(data.table)
# convert to data.table
setDT(Democrat)

# get logical vector that proclaims winner if vote fraction is maximum
Democrat[, winner := fraction_votes == max(fraction_votes), by=fips]

which returns

Democrat
     state state_abbreviation  county fips    party candidate votes fraction_votes winner
1: Alabama                 AL Autauga 1001 Democrat    Bernie   544          0.182  FALSE
2: Alabama                 AL Autauga 1001 Democrat   Hillary  2387          0.800   TRUE
3: Alabama                 AL Baldwin 1003 Democrat    Bernie  2694          0.329  FALSE
4: Alabama                 AL Baldwin 1003 Democrat   Hillary  5290          0.647   TRUE
5: Alabama                 AL Barbour 1005 Democrat    Bernie   222          0.078  FALSE
6: Alabama                 AL Barbour 1005 Democrat   Hillary  2567          0.906   TRUE

data

Democrat <-
structure(list(state = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Alabama", class = "factor"), 
    state_abbreviation = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "AL", class = "factor"), 
    county = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("Autauga", 
    "Baldwin", "Barbour"), class = "factor"), fips = c(1001L, 
    1001L, 1003L, 1003L, 1005L, 1005L), party = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = "Democrat", class = "factor"), 
    candidate = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Bernie", 
    "Hillary"), class = "factor"), votes = c(544L, 2387L, 2694L, 
    5290L, 222L, 2567L), fraction_votes = c(0.182, 0.8, 0.329, 
    0.647, 0.078, 0.906)), .Names = c("state", "state_abbreviation", 
"county", "fips", "party", "candidate", "votes", "fraction_votes"
), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

Upvotes: 1

ira
ira

Reputation: 2644

I would first use summarise function from dplyr package to find the maximum number of votes any candidate received in a given county, then add the column with county maximum to the original dataset, then calculate the result.

# create a sample dataset akin to the question setup
df <- data.frame(abrev = rep("AL", 6), county = c("Autuga", "Autuga", "Baldwin", "Baldwin",
                                                  "Barbour", "Barbour"),
                 party = rep("Democrat", 6), 
                 candidate = rep(c("Bernie", "Hillary"), 3),
                 fraction_votes = c(0.18, 0.8, 0.32, 0.64, 0.07, 0.9))

# load a dplyr library
library(dplyr)

# calculate what was the maximum ammount of votes candidate received in a given county

# take a df dataset
winners <- df %>%
        # group it by a county
        group_by(county) %>%
        # for each county, calculate maximum of votes
        summarise(score = max(fraction_votes))

# join the original dataset and the dataset with county maximumus
# join them by county column
df <- left_join(df, winners, by = c("county"))

# calculate the result column
df$result <- ifelse(df$fraction_votes == df$score, "Win", "Loss")

If there are different counties with same name, you would have to adjust the grouping and joining part, but the logic should be the same

Upvotes: 1

Related Questions