Reputation: 330

Calculating the mean value when the total is different per condition

I am working with R.

Here I share a sample of my data...

structure(list(column_a = c("1_1", "1_1", "1_2", "1_2", "1_2", 
"2_1", "2_2", "2_2", "3_1", "3_2"), column_b = c("kitchen", "tree", 
"hate", "kind", "table", "dog", "human", "car", "moon", "rage"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

   column_a column_b
1       1_1  kitchen
2       1_1     tree
3       1_2     hate
4       1_2     kind
5       1_2    table
6       2_1      dog
7       2_2    human
8       2_2      car
9       3_1     moon
10      3_2     rage

I need to calculate the mean value of the total words produced by condition (1_1, 1_2, etc.). The only problem that I have is that the conditions that finish with "_1" have a total number of 50 and the conditions that end with "_2" have a total number of 100.

So, since the condition "1_1" has two words produced (in the sample) I should calculate the mean by diving by 50. This is 2/50 = 0.04. However, when calculating the mean for the "1_2" condition I need to divide by 100. This is 3/100 = 0.03.

I need to create a column that has the mean value of the total words produced by condition, considering that some condition needs to be calculated by 50 and the other ones by a 100. How can I do this and still have the results in the same column?

Upvotes: 3

Answers (4)

ThomasIsCoding

Reputation: 102770

A base R option using ave + endsWith

transform(
  df,
  value = ave(column_a,
    column_a,
    FUN = function(x) length(x) / ifelse(endsWith(x, "_1"), 50, 100)
  )
)

gives

   column_a column_b value
1       1_1  kitchen  0.04
2       1_1     tree  0.04
3       1_2     hate  0.03
4       1_2     kind  0.03
5       1_2    table  0.03
6       2_1      dog  0.02
7       2_2    human  0.02
8       2_2      car  0.02
9       3_1     moon  0.02
10      3_2     rage  0.01

Upvotes: 2

akrun

Reputation: 887951

Using data.table

library(data.table)
setDT(df)[, n := .N, column_a][, .(mean_group = mean(n)), by = column_a]

Upvotes: 2

TarJae

Reputation: 79286

Just count by group and calculate the mean:

df %>% 
  group_by(column_a) %>% 
  add_count() %>% 
  summarise(mean_group = mean(n))

Output:

  column_a mean_group
  <chr>         <dbl>
1 1_1               2
2 1_2               3
3 2_1               1
4 2_2               2
5 3_1               1
6 3_2               1

Upvotes: 6

Ronak Shah

Reputation: 389325

You can use case_when to include a new column (divide) which is either 50 or 100 and for divide number of rows in the group (column_a) by it.

library(dplyr)

df %>%
  mutate(divide = case_when(endsWith(column_a, '_1') ~ 50, 
                            endsWith(column_a, '_2') ~ 100)) %>%
  group_by(column_a) %>%
  mutate(value = n()/divide) %>%
  ungroup

#   column_a column_b divide value
#   <chr>    <chr>     <dbl> <dbl>
# 1 1_1      kitchen      50  0.04
# 2 1_1      tree         50  0.04
# 3 1_2      hate        100  0.03
# 4 1_2      kind        100  0.03
# 5 1_2      table       100  0.03
# 6 2_1      dog          50  0.02
# 7 2_2      human       100  0.02
# 8 2_2      car         100  0.02
# 9 3_1      moon         50  0.02
#10 3_2      rage        100  0.01

Also similar with add_count -

library(dplyr)

df %>%
  mutate(divide = case_when(endsWith(column_a, '_1') ~ 50, 
                            endsWith(column_a, '_2') ~ 100)) %>%
  add_count(column_a) %>%
  mutate(value = n/divide) 
  ungroup

Upvotes: 4

Calculating the mean value when the total is different per condition

Answers (4)

Related Questions