Reputation: 391
I have a data frame looks like below:
name workplace year note1 note2 job
Ben Alpha 2011 xxxx xx director
Ben Beta 2011 xx xxx director
Ben Beta 2011 xxx xxxx vice president
Wendy Sigma 2011 xxxx x director
Wendy Sigma 2011 xx xx vice president
Wendy Sigma 2011 x xxx CEO
Alice Beta 2011 xxx x staff
Alice Beta 2012 xx xx deputy director
I want to identify and merge the duplicated rows based on columns "name", "workplace" and "year" (don't consider columns "note1" and "note2). And the information in column "job" will be merged. The output should look like below. Note that the information in "job" is merged based on matching "name", "workplace" and "year". Information in "note1" and "note2" don't need to merge and should be the "note1" and "note2" information in the first row of the matching rows:
name workplace year note1 note2 job.1 job.2 job.3
Ben Alpha 2011 xxxx xx director NA NA
Ben Beta 2011 xx xxx director vice president
Wendy Sigma 2011 xxxx x director vice president CEO
Alice Beta 2011 xxx x staff NA
Alice Beta 2012 xx xx secretary NA NA
Upvotes: 0
Views: 58
Reputation: 26218
Another approach without using pivot
approach. Here you can use first
or last
or whatever aggregate functions for notes
fields as desired. You can make use of appropriate arguments to mute all warnings
df %>% group_by(name, workplace, year) %>%
summarise(note1 = last(note1),
note2 = last(note2),
job = toString(job), .groups = 'drop') %>%
separate(job, into = paste0('Job', seq_len(max(1 + str_count(.$job, ',')))),
sep = ', ',
extra = "drop",
fill = 'right')
# A tibble: 5 x 8
name workplace year note1 note2 Job1 Job2 Job3
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
1 Alice Beta 2011 xxx x staff NA NA
2 Alice Beta 2012 xx xx deputy director NA NA
3 Ben Alpha 2011 xxxx xx director NA NA
4 Ben Beta 2011 xxx xxxx director vice president NA
5 Wendy Sigma 2011 x xxx director vice president CEO
Upvotes: 1
Reputation: 66500
Here's an approach using dplyr
and tidyr
. First, I remove the notes fields to deal with those separately. Then I assign a row number to each job row within a name/workplace/year group. Then spread into columns based on those jobs. Then finally, add the note from the first row of each name/workplace/year.
library(tidyr); library(dplyr)
my_data %>%
select(-note1, -note2) %>%
group_by(name, workplace, year) %>%
mutate(job_num = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = job_num, values_from = job, names_prefix = "job.") %>%
left_join(my_data %>% distinct(name, workplace, year, .keep_all = TRUE))
Result:
# A tibble: 5 x 9
name workplace year job.1 job.2 job.3 note1 note2 job
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 Ben Alpha 2011 director NA NA xxxx xx director
2 Ben Beta 2011 director vice president NA xx xxx director
3 Wendy Sigma 2011 director vice president CEO xxxx x director
4 Alice Beta 2011 staff NA NA xxx x staff
5 Alice Beta 2012 deputy director NA NA xx xx deputy director
Upvotes: 1