Reputation: 97
I have a dataset with different companies that have published articles in different blogs (but they use similar names, not always the same) and I want to group them by similar results and count in how many blogs they have published articles.
I want to group it by similar name results, keep the address of the first result and then check if there is a 1 (published article) or a 0 (no published article) among the variables of the rest of the results.
I have a similar question here for the first part but now I don't know how to manage the 2 actions at the same time.
This is a sample of my dataset:
name address sports_blog nutrition_blog lifestyle_blog nature_blog
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Wellington Adam Martin Sq. 1 1 0 0 0
2 Wellingtoon Adam Martin Sq. 1 0 1 0 0
3 Wellington Co. Adam Martin Sq. 1 0 0 1 0
4 Welinton Adam Martin Sq. 1 0 0 0 1
5 Cornell Blue cross street 1 0 0 0
6 Kornell Blue cross street 0 1 0 0
7 Coornell Blue cross street 0 0 0 1
8 Bleend Aloha avenue 0 0 1 0
9 Blind Aloha avenue 0 0 0 1
10 Laguna River street 1 0 0 0
11 Papito Carnival street 1 0 0 0
12 Papeeto Carnival street 0 0 1 0
And as result, I'm looking for something like this:
name address sports_blog nutrition_blog lifestyle_blog nature_blog
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Wellington Adam Martin Sq. 1 1 1 1 1
2 Cornell Blue cross street 1 1 0 1
3 Bleend Aloha avenue 0 0 1 1
4 Laguna River street 1 0 0 0
5 Papito Carnival street 1 0 1 0
Upvotes: 1
Views: 56
Reputation: 51592
You can simply include it in your grouping. Using the function from your previous answer (given by @RuiBarradas), then
library(dplyr)
df %>%
group_by(name = name[similarGroups(name)], address) %>%
summarise_all(sum)
which gives,
# A tibble: 5 x 6 # Groups: grp [5] name address sports_blog nutrition_blog lifestyle_blog nature_blog <fct> <fct> <int> <int> <int> <int> 1 Bleend Alohaavenue 0 0 1 1 2 Cornell Bluecrossstreet 1 1 0 1 3 Laguna Riverstreet 1 0 0 0 4 Papito Carnivalstreet 1 0 1 0 5 Wellington AdamMartinSq1 1 1 1 1
Upvotes: 1