My data looks like this: df1 #> Artist Album Year #> 1 Beatles Sgt. Pepper's 1967 #> 2 Rolling Stones Sticky Fingers 1971 and df2 #> Artist Members #> 1 Beatles George #> 2 Beatles Ringo #> 3 Beatles Paul #> 4 Beatles John And I would like to join these two dfs, in what I believe is an "untidy" fashion. Despite the untidyness, it will be very helpful for me to have the final output look like the example below, where every band (Artist) only takes up one row, and the band members are all put into one column, separated by say commas: Desired Output #> Artist Album Members Year #> 1 Beatles Sgt. Pepper's George, Ringo, Paul, John 1967 #> 2 Rolling Stones Sticky Fingers 1971 I've been able to get close to a solution (below), but: Is there a simpler way to do this? How can I generalise my code so that if there is a band with say 11 members, or 13 members, the code will still work? When data is missing, like for the Rolling Stones, the values are "NA". Is it easy to make them blank? library(tidyverse) df1 <- data.frame(stringsAsFactors=FALSE, Artist = c("Beatles", "Rolling Stones"), Album = c("Sgt. Pepper's", "Sticky Fingers"), Year = c(1967, 1971) ) df2 <- data.frame(stringsAsFactors=FALSE, Artist = c("Beatles", "Beatles", "Beatles", "Beatles"), Members = c("George", "Ringo", "Paul", "John") ) df <- left_join(df1, df2, by = "Artist") df <- df %>% group_by(Artist) %>% mutate(member_number = seq_along(Members)) df <- spread(df, key = "member_number", value = "Members", sep = "_") df <- df %>% unite(col = "members", member_number_1:member_number_4, sep = ",") Which gives the output df #> # A tibble: 2 x 4 #> # Groups: Artist [2] #> Artist Album Year members #> <chr> <chr> <dbl> <chr> #> 1 Beatles Sgt. Pepper's 1967 George,Ringo,Paul,John #> 2 Rolling Stones Sticky Fingers 1971 NA,NA,NA,NA

Reputation: 1792

join two dataframes so that a column contains multiple values

My data looks like this:

df1
#>           Artist          Album Year
#> 1        Beatles  Sgt. Pepper's 1967
#> 2 Rolling Stones Sticky Fingers 1971

and

df2
#>    Artist Members
#> 1 Beatles  George
#> 2 Beatles   Ringo
#> 3 Beatles    Paul
#> 4 Beatles    John

And I would like to join these two dfs, in what I believe is an "untidy" fashion. Despite the untidyness, it will be very helpful for me to have the final output look like the example below, where every band (Artist) only takes up one row, and the band members are all put into one column, separated by say commas:

Desired Output
#>           Artist          Album                   Members Year
#> 1        Beatles  Sgt. Pepper's George, Ringo, Paul, John 1967
#> 2 Rolling Stones Sticky Fingers                           1971

I've been able to get close to a solution (below), but:

Is there a simpler way to do this?
How can I generalise my code so that if there is a band with say 11 members, or 13 members, the code will still work?
When data is missing, like for the Rolling Stones, the values are "NA". Is it easy to make them blank?

library(tidyverse)
df1 <- data.frame(stringsAsFactors=FALSE,
      Artist = c("Beatles", "Rolling Stones"),
       Album = c("Sgt. Pepper's", "Sticky Fingers"),
        Year = c(1967, 1971)
)

df2 <- data.frame(stringsAsFactors=FALSE,
       Artist = c("Beatles", "Beatles", "Beatles", "Beatles"),
    Members = c("George", "Ringo", "Paul", "John")
)

df <- left_join(df1, df2, by = "Artist")
df <- df %>% group_by(Artist) %>% mutate(member_number = seq_along(Members))
df <- spread(df, key = "member_number", value = "Members", sep = "_")
df <- df %>% unite(col = "members", member_number_1:member_number_4, sep = ",")

Which gives the output

df
#> # A tibble: 2 x 4
#> # Groups:   Artist [2]
#>   Artist         Album           Year members               
#>   <chr>          <chr>          <dbl> <chr>                 
#> 1 Beatles        Sgt. Pepper's   1967 George,Ringo,Paul,John
#> 2 Rolling Stones Sticky Fingers  1971 NA,NA,NA,NA

Upvotes: 1

Answers (4)

moodymudskipper

Reputation: 47340

My package safejoin allows aggregation operations on the joined table by the joining variable :

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
df1 %>% eat(df2, .agg = toString)
# Joining, by = "Artist"
#           Artist          Album Year                   Members
# 1        Beatles  Sgt. Pepper's 1967 George, Ringo, Paul, John
# 2 Rolling Stones Sticky Fingers 1971                      <NA>

Upvotes: 0

akrun

Reputation: 887841

Using data.table

library(data.table)
setDT(df2)[df1, on = .(Artist)][, .(members = toString(Members)),
   .(Artist, Album, Year)]
#          Artist          Album Year                   members
#1:        Beatles  Sgt. Pepper's 1967 George, Ringo, Paul, John
#2: Rolling Stones Sticky Fingers 1971                        NA

Upvotes: 2

Stephen Henderson

Reputation: 6532

slightly differently:

library(dplyr)


 left_join(df1, df2) %>% 
    group_by(Artist, Album, Year) %>% 
    summarise(members = paste(Members, collapse = ","))

# A tibble: 2 x 4
# Groups:   Artist, Album [?]
  Artist         Album           Year members               
  <chr>          <chr>          <dbl> <chr>                 
1 Beatles        Sgt. Pepper's   1967 George,Ringo,Paul,John
2 Rolling Stones Sticky Fingers  1971 NA

Upvotes: 3

Ronak Shah

Reputation: 389235

We can left_join and then summarise multiple columns and collapse them into unique comma-separated strings.

library(dplyr)

left_join(df1, df2, by = "Artist") %>%
   group_by(Artist) %>%
   summarise_at(vars(Album:Members), ~toString(unique(.)))

# A tibble: 2 x 4
#  Artist         Album          Year  Members                  
#  <chr>          <chr>          <chr> <chr>                    
#1 Beatles        Sgt. Pepper's  1967  George, Ringo, Paul, John
#2 Rolling Stones Sticky Fingers 1971  NA

Upvotes: 2

join two dataframes so that a column contains multiple values

Answers (4)

Related Questions