MeC
MeC

Reputation: 463

call variable that has been grouped by

Some sample data:

 df <- data.frame(lang = rep(c("A", "B", "C"), 3), 
                  answer = rep(c("1", "2", "3"), each=3))

I am getting an error when I try to call a variable that I recently grouped by:

 df2 <- df %>%
   Total = count(lang) %>%  # count is short hand for tally + group_by()
   filter(answer=='2') %>% 
   mutate(prop = NROW(answer)/NROW(Total)) 

 Error in group_vars(x) : object 'lang' not found

I would like a new column on my dataframe that says the proportion of the answer '2' to total observations in each level of lang. So how many times does '2' occur in 'A' in proportion to the total number of observations in 'A'?

Upvotes: 1

Views: 144

Answers (2)

hedgedandlevered
hedgedandlevered

Reputation: 2394

Alternate solution with data.table

I prefer to use data.table than data frames everywhere personally. Here is the implementation with that method, although admittedly it looks a bit more cryptic than the solution in dplyr (The syntax to accomplish something like this may be more involved, but getting used to it ends up giving you a whole bag of tricks, and with simple queries the syntax actually looks better)

You end up trying to use "lang" like its a variable, when its a name of a column.

To get the values requested, 0.3333 for each,

library(data.table)
df <- data.table(df)
df[, nrow(.SD[answer == 2])/nrow(.SD), by="lang"]

   lang        V1
1:    A 0.3333333
2:    B 0.3333333
3:    C 0.3333333

(the special variable .SD allows you to manipulate every subset of the data, split by by)

Upvotes: 0

GenesRus
GenesRus

Reputation: 1057

Here's a solution that does what you want:

df %>% 
  group_by(lang) %>% 
  summarize(
    prop = length(lang[answer==2])/n()
  )

Here, we group by the variable or variables that you want set as the unique groups you want to get the proportion of and then use summarize to calculate the length of the vector of one of the variables where answer is equal to 2 and divide that by the number of rows in the grouping. If, for whatever reason, you want the prop column AND the answer column, just change summarize to mutate.

The reason you were getting the error about not finding lang is because count needs to be used as a function like mutate, i.e.

df %>% 
  count(lang, name = "Total")

You could achieve the same thing adapting your code, but you should use add_count (so your answer column is preserved) or mutate(Total = n()). However, group_by was designed to address problems such as this and is definitely worth spending some time to learn about.

df %>% 
  add_count(lang, name = "Total") %>% 
  filter(answer == 2) %>% 
  add_count(lang, name = "Twos") %>% 
  distinct(lang, .keep_all = TRUE) %>% 
  mutate(prop = Twos/Total) %>% 
  select(lang, prop)

Upvotes: 3

Related Questions