How can I exclude users in a survey with the same ID for calculations in R?

Question

My data.frame includes the results from a survey and looks like this:

date	id	age	gender	...
01-02	99	20	1	...
01-20	52	34	2	...
01-23	47	20	1	...
01-02	100	56	1	...
02-05	99	20	1	...
02-17	78	18	2	...
02-28	47	20	1	...

the users are allowed to attend each month, up to 10 times at the survey, so I have users who's personal data occurs more often in the table.

Now to my problem: How can I get the mean (e.g. age) of all users who attended the survey? If I just put it mean(df$age), obviously those who did attend more than once will be overrepresented.

How can I get a list with counting users who attended once, twice, ... ten times? e.g.:

number of participations	number of users
1	2,047
2	23,127
3	50,000

I haven't found a solution for this, so I'm grateful for any help. Thanks in advance!

Ronak Shah · Accepted Answer

To get average age of the participants you can keep only the unique id's of the data and calculate the average.

In dplyr you can do this with distinct and summarise.

library(dplyr)

df %>%
  distinct(id, .keep_all = TRUE) %>%
  summarise(avg_age = mean(age))

#  avg_age
#1    29.6

To count how many times an individual responded to the survey you can use count

df %>% count(id, name = 'count')

#   id count
#1  47     2
#2  52     1
#3  78     1
#4  99     2
#5 100     1

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(date = c("01-02", "01-20", "01-23", "01-02", "02-05", 
"02-17", "02-28"), id = c(99L, 52L, 47L, 100L, 99L, 78L, 47L), 
    age = c(20L, 34L, 20L, 56L, 20L, 18L, 20L), gender = c(1L, 
    2L, 1L, 1L, 1L, 2L, 1L)), row.names = c(NA, -7L), class = "data.frame")

How can I exclude users in a survey with the same ID for calculations in R?

Answers (1)

Related Questions