Reputation: 33
This is my first question on stackoverflow. I searched for similar questions but I didn't find an answer.
I know that the question in the title isn't clear but I hope you are going to understand what I want as output.
I have a dataframe that looks like this:
ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019
if the names in the column "Name" have the same ID, it means that are collaborators in a project.
I want to add a column with the number of UNIQUE collaborators per year for each name (by including each name in the count of his collborators)
The output should look like this: (I added the last column to explain how to count-Not needed)
ID Name Year Unique_Coll explication
1 1 Anas 2018 3 (Anas, Carl, Catherine)
2 1 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
3 1 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
4 2 Anas 2018 3 (Anas, Carl, Catherine)
5 2 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
6 3 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
7 3 Julien 2018 2 (Julien, Catherine)
8 4 Raul 2018 4 (Raul, Ahmed, Laurence, Carl)
9 4 Ahmed 2018 4 (Ahmed, Raul, Laurence, Carl)
10 4 Laurence 2018 4 (Laurence, Raul, Ahmed, Carl)
11 4 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
12 5 Anas 2019 4 (Anas, Georges, Arman, Pietro)
13 5 Georges 2019 3 (Georges, Anas, Arman)
14 5 Arman 2019 3 (Arman Anas, Georges)
15 6 Anas 2019 4 (Anas, Georges, Arman, Pietro)
16 6 Pietro 2019 2 (Pietro, Anas)
17 7 Pietro 2019 2 (Pietro, Anas)
18 8 Diego 2019 1 (Diego)
Thank you
Upvotes: 3
Views: 134
Reputation: 3134
I have a solution using joins.
library(tidyverse)
# read data
dta <- tribble(~ID, ~Name, ~Year,
1, "Anas", 2018,
1, "Carl", 2018,
1, "Catherine", 2018,
2, "Anas", 2018,
2, "Carl", 2018,
3, "Catherine", 2018,
3, "Julien", 2018,
4, "Raul", 2018,
4, "Ahmed", 2018,
4, "Laurence", 2018,
4, "Carl", 2018,
5, "Anas", 2019,
5, "Georges", 2019,
5, "Arman", 2019,
6, "Anas", 2019,
6, "Pietro", 2019,
7, "Pietro", 2019,
8, "Diego", 2019)
nb_collabs <- dta %>%
left_join(dta, by = c("ID", "Year")) %>%
select(-ID) %>%
group_by(Name.x, Year) %>%
nest(collaborators = Name.y) %>%
mutate(unique_collaborators = map(collaborators, distinct),
Unique_Coll = map_int(unique_collaborators, nrow)) %>%
select(-collaborators, -unique_collaborators)
left_join(dta, nb_collabs, by = c("Name"="Name.x", "Year"))
# A tibble: 18 x 4
# ID Name Year Unique_Coll
# <dbl> <chr> <dbl> <int>
# 1 1 Anas 2018 3
# 2 1 Carl 2018 6
# 3 1 Catherine 2018 4
# 4 2 Anas 2018 3
# 5 2 Carl 2018 6
# 6 3 Catherine 2018 4
# 7 3 Julien 2018 2
# 8 4 Raul 2018 4
# 9 4 Ahmed 2018 4
#10 4 Laurence 2018 4
#11 4 Carl 2018 6
#12 5 Anas 2019 4
#13 5 Georges 2019 3
#14 5 Arman 2019 3
#15 6 Anas 2019 4
#16 6 Pietro 2019 2
#17 7 Pietro 2019 2
#18 8 Diego 2019 1
So the first step is to join the data with itself. The point is to have the name in "Name.x", and a separate row for each collaborator as "Name.y". Then we can nest the collaborator names, so that we get a data frame with one row for each Name, with a nested data frame with the collaborators, so we just need to remove the duplicates and count the number of persons.
In nb_collabs
we have a table with each person and the number of collaborators, we can simply join
it back with the original data frame to get the desired format.
Upvotes: 0
Reputation: 3875
You could construct a variable that would be a list of names and count the number of unique names in the following way:
library(dplyr)
df = df %>%
group_by(ID) %>%
mutate(group = list(Name)) %>%
group_by(Year,Name) %>%
mutate(n = n_distinct(unlist(list(group)))) %>%
select(-group)
# A tibble: 18 x 4
# Groups: Year, Name [12]
ID Name Year n
<int> <chr> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
Upvotes: 2
Reputation: 4658
The following solution uses dplyr
to first join all collaborators to every Name
, creating a column Name_collab
(note that this expands the data frame and could blow it up if it were large). Then, we count the distinct Name_collab
for every Name
, Year
combination and get rid of duplicates.
library(dplyr)
df %>%
left_join(df, by = c("ID", "Year"), suffix = c("", "_collab")) %>%
group_by(Name, Year) %>%
mutate(Unique_Coll = n_distinct(Name_collab)) %>%
ungroup() %>%
distinct(ID, Name, Year, Unique_Coll)
which gives
# A tibble: 18 x 4
ID Name Year Unique_Coll
<int> <fct> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
Input:
df <- read.table(text="ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019")
Upvotes: 2