farina27
farina27

Reputation: 33

R: Count unique rows for each unique name per year

This is my first question on stackoverflow. I searched for similar questions but I didn't find an answer.

I know that the question in the title isn't clear but I hope you are going to understand what I want as output.

I have a dataframe that looks like this:

   ID      Name Year
1   1      Anas 2018
2   1      Carl 2018
3   1 Catherine 2018
4   2      Anas 2018
5   2      Carl 2018
6   3 Catherine 2018
7   3    Julien 2018
8   4      Raul 2018
9   4     Ahmed 2018
10  4  Laurence 2018
11  4      Carl 2018
12  5      Anas 2019
13  5   Georges 2019
14  5     Arman 2019
15  6      Anas 2019
16  6    Pietro 2019
17  7    Pietro 2019
18  8     Diego 2019

if the names in the column "Name" have the same ID, it means that are collaborators in a project.

I want to add a column with the number of UNIQUE collaborators per year for each name (by including each name in the count of his collborators)

The output should look like this: (I added the last column to explain how to count-Not needed)

ID      Name Year Unique_Coll                                      explication
1   1      Anas 2018           3                          (Anas, Carl, Catherine)
2   1      Carl 2018           6  (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
3   1 Catherine 2018           4                  (Catherine, Carl, Anas, Julien)
4   2      Anas 2018           3                          (Anas, Carl, Catherine)
5   2      Carl 2018           6  (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
6   3 Catherine 2018           4                  (Catherine, Carl, Anas, Julien)
7   3    Julien 2018           2                              (Julien, Catherine)
8   4      Raul 2018           4                    (Raul, Ahmed, Laurence, Carl)
9   4     Ahmed 2018           4                    (Ahmed, Raul, Laurence, Carl)
10  4  Laurence 2018           4                    (Laurence, Raul, Ahmed, Carl)
11  4      Carl 2018           6  (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
12  5      Anas 2019           4                   (Anas, Georges, Arman, Pietro)
13  5   Georges 2019           3                           (Georges, Anas, Arman)
14  5     Arman 2019           3                            (Arman Anas, Georges)
15  6      Anas 2019           4                   (Anas, Georges, Arman, Pietro)
16  6    Pietro 2019           2                                   (Pietro, Anas)
17  7    Pietro 2019           2                                   (Pietro, Anas)
18  8     Diego 2019           1                                        (Diego)  

Thank you

Upvotes: 3

Views: 134

Answers (3)

Alexlok
Alexlok

Reputation: 3134

I have a solution using joins.

library(tidyverse)

# read data
dta <- tribble(~ID, ~Name, ~Year,
               1, "Anas", 2018,
               1, "Carl", 2018,
               1, "Catherine", 2018,
               2, "Anas", 2018,
               2, "Carl", 2018,
               3, "Catherine", 2018,
               3, "Julien", 2018,
               4, "Raul", 2018,
               4, "Ahmed", 2018,
               4, "Laurence", 2018,
               4, "Carl", 2018,
               5, "Anas", 2019,
               5, "Georges", 2019,
               5, "Arman", 2019,
               6, "Anas", 2019,
               6, "Pietro", 2019,
               7, "Pietro", 2019,
               8, "Diego", 2019)

nb_collabs <- dta %>%
  left_join(dta, by = c("ID", "Year")) %>%
  select(-ID) %>%
  group_by(Name.x, Year) %>%
  nest(collaborators = Name.y) %>%
  mutate(unique_collaborators = map(collaborators, distinct),
         Unique_Coll = map_int(unique_collaborators, nrow)) %>%
  select(-collaborators, -unique_collaborators)

left_join(dta, nb_collabs, by = c("Name"="Name.x", "Year"))
# A tibble: 18 x 4
#      ID Name       Year Unique_Coll
#   <dbl> <chr>     <dbl>       <int>
# 1     1 Anas       2018           3
# 2     1 Carl       2018           6
# 3     1 Catherine  2018           4
# 4     2 Anas       2018           3
# 5     2 Carl       2018           6
# 6     3 Catherine  2018           4
# 7     3 Julien     2018           2
# 8     4 Raul       2018           4
# 9     4 Ahmed      2018           4
#10     4 Laurence   2018           4
#11     4 Carl       2018           6
#12     5 Anas       2019           4
#13     5 Georges    2019           3
#14     5 Arman      2019           3
#15     6 Anas       2019           4
#16     6 Pietro     2019           2
#17     7 Pietro     2019           2
#18     8 Diego      2019           1

So the first step is to join the data with itself. The point is to have the name in "Name.x", and a separate row for each collaborator as "Name.y". Then we can nest the collaborator names, so that we get a data frame with one row for each Name, with a nested data frame with the collaborators, so we just need to remove the duplicates and count the number of persons.

In nb_collabs we have a table with each person and the number of collaborators, we can simply join it back with the original data frame to get the desired format.

Upvotes: 0

Lamia
Lamia

Reputation: 3875

You could construct a variable that would be a list of names and count the number of unique names in the following way:

library(dplyr)
df = df %>% 
group_by(ID) %>% 
mutate(group = list(Name)) %>% 
group_by(Year,Name) %>% 
mutate(n = n_distinct(unlist(list(group)))) %>%
select(-group)

# A tibble: 18 x 4
# Groups:   Year, Name [12]
      ID Name       Year     n
   <int> <chr>     <int> <int>
 1     1 Anas       2018     3
 2     1 Carl       2018     6
 3     1 Catherine  2018     4
 4     2 Anas       2018     3
 5     2 Carl       2018     6
 6     3 Catherine  2018     4
 7     3 Julien     2018     2
 8     4 Raul       2018     4
 9     4 Ahmed      2018     4
10     4 Laurence   2018     4
11     4 Carl       2018     6
12     5 Anas       2019     4
13     5 Georges    2019     3
14     5 Arman      2019     3
15     6 Anas       2019     4
16     6 Pietro     2019     2
17     7 Pietro     2019     2
18     8 Diego      2019     1

Upvotes: 2

Bas
Bas

Reputation: 4658

The following solution uses dplyr to first join all collaborators to every Name, creating a column Name_collab (note that this expands the data frame and could blow it up if it were large). Then, we count the distinct Name_collab for every Name, Year combination and get rid of duplicates.

library(dplyr)

df %>% 
  left_join(df, by = c("ID", "Year"), suffix = c("", "_collab")) %>% 
  group_by(Name, Year) %>% 
  mutate(Unique_Coll = n_distinct(Name_collab)) %>% 
  ungroup() %>% 
  distinct(ID, Name, Year, Unique_Coll)

which gives

# A tibble: 18 x 4
      ID Name       Year Unique_Coll
   <int> <fct>     <int>       <int>
 1     1 Anas       2018           3
 2     1 Carl       2018           6
 3     1 Catherine  2018           4
 4     2 Anas       2018           3
 5     2 Carl       2018           6
 6     3 Catherine  2018           4
 7     3 Julien     2018           2
 8     4 Raul       2018           4
 9     4 Ahmed      2018           4
10     4 Laurence   2018           4
11     4 Carl       2018           6
12     5 Anas       2019           4
13     5 Georges    2019           3
14     5 Arman      2019           3
15     6 Anas       2019           4
16     6 Pietro     2019           2
17     7 Pietro     2019           2
18     8 Diego      2019           1

Input:

df <- read.table(text="ID      Name Year
1   1      Anas 2018
2   1      Carl 2018
3   1 Catherine 2018
4   2      Anas 2018
5   2      Carl 2018
6   3 Catherine 2018
7   3    Julien 2018
8   4      Raul 2018
9   4     Ahmed 2018
10  4  Laurence 2018
11  4      Carl 2018
12  5      Anas 2019
13  5   Georges 2019
14  5     Arman 2019
15  6      Anas 2019
16  6    Pietro 2019
17  7    Pietro 2019
18  8     Diego 2019")

Upvotes: 2

Related Questions