RandomThinker
RandomThinker

Reputation: 391

dataframe partial merge in R

I have a data frame looks like below:

name    workplace   year   note1  note2  job
Ben     Alpha       2011   xxxx   xx     director
Ben     Beta        2011   xx     xxx    director
Ben     Beta        2011   xxx    xxxx   vice president
Wendy   Sigma       2011   xxxx   x      director
Wendy   Sigma       2011   xx     xx     vice president
Wendy   Sigma       2011   x      xxx    CEO
Alice   Beta        2011   xxx    x      staff
Alice   Beta        2012   xx     xx     deputy director

I want to identify and merge the duplicated rows based on columns "name", "workplace" and "year" (don't consider columns "note1" and "note2). And the information in column "job" will be merged. The output should look like below. Note that the information in "job" is merged based on matching "name", "workplace" and "year". Information in "note1" and "note2" don't need to merge and should be the "note1" and "note2" information in the first row of the matching rows:

name    workplace   year   note1  note2  job.1      job.2           job.3
Ben     Alpha       2011   xxxx   xx     director   NA              NA
Ben     Beta        2011   xx     xxx    director   vice president
Wendy   Sigma       2011   xxxx   x      director   vice president  CEO
Alice   Beta        2011   xxx    x      staff      NA
Alice   Beta        2012   xx     xx     secretary  NA              NA

Upvotes: 0

Views: 58

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26218

Another approach without using pivot approach. Here you can use first or last or whatever aggregate functions for notes fields as desired. You can make use of appropriate arguments to mute all warnings

df %>% group_by(name, workplace, year) %>%
  summarise(note1 = last(note1),
            note2 = last(note2),
            job = toString(job), .groups = 'drop') %>%
  separate(job, into = paste0('Job', seq_len(max(1 + str_count(.$job, ',')))), 
           sep = ', ',
           extra = "drop", 
           fill = 'right')

# A tibble: 5 x 8
  name  workplace  year note1 note2 Job1            Job2           Job3 
  <chr> <chr>     <int> <chr> <chr> <chr>           <chr>          <chr>
1 Alice Beta       2011 xxx   x     staff           NA             NA   
2 Alice Beta       2012 xx    xx    deputy director NA             NA   
3 Ben   Alpha      2011 xxxx  xx    director        NA             NA   
4 Ben   Beta       2011 xxx   xxxx  director        vice president NA   
5 Wendy Sigma      2011 x     xxx   director        vice president CEO

Upvotes: 1

Jon Spring
Jon Spring

Reputation: 66500

Here's an approach using dplyr and tidyr. First, I remove the notes fields to deal with those separately. Then I assign a row number to each job row within a name/workplace/year group. Then spread into columns based on those jobs. Then finally, add the note from the first row of each name/workplace/year.

library(tidyr); library(dplyr)
my_data %>%
  select(-note1, -note2) %>%
  
  group_by(name, workplace, year) %>%
  mutate(job_num = row_number()) %>%
  ungroup() %>%
  
  pivot_wider(names_from = job_num, values_from = job, names_prefix = "job.") %>%
  
  left_join(my_data %>% distinct(name, workplace, year, .keep_all = TRUE))

Result:

# A tibble: 5 x 9
  name  workplace  year job.1           job.2          job.3 note1 note2 job            
  <chr> <chr>     <int> <chr>           <chr>          <chr> <chr> <chr> <chr>          
1 Ben   Alpha      2011 director        NA             NA    xxxx  xx    director       
2 Ben   Beta       2011 director        vice president NA    xx    xxx   director       
3 Wendy Sigma      2011 director        vice president CEO   xxxx  x     director       
4 Alice Beta       2011 staff           NA             NA    xxx   x     staff          
5 Alice Beta       2012 deputy director NA             NA    xx    xx    deputy director

Upvotes: 1

Related Questions