user16641281
user16641281

Reputation:

Merge rows when same word appears in another column

I want to merge rows together that have the same word in another column. The table entries are all comma separated strings (characters), not lists. So as shown below the color shades for the same color should be added in a string in one row, instead of spanning over several rows. Also there should be no duplicates in the color shades column.

I have already tried with:

   aggregate(df["Color shades"], df["Color"], paste, collapse=", ")

As well as with:

   aggregate(Color shades ~ Color ,df ,toString)

But that didn't lead to the desired result.

Dataframe:

    df <- data.frame(colorshades = c("turquoise, babyblue", "royal blue, true blue", 
                         "navy blue, true blue"), colors = c("blue", "blue", "blue"))

Currently:

Color shades Color
turquoise, babyblue blue
royal blue, true blue blue
navy blue, true blue blue

Desired Output:

Color shades Color
turquoise, babyblue, royal blue, true blue, navy blue blue

Upvotes: 1

Views: 59

Answers (4)

JR96
JR96

Reputation: 973

One can also use tidytext to with unnest

library(dplyr)
library(tidytext)

color_df <- tibble(color= rep("blue", times = 3),
                       color_shades = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"))

color_shades_agg <- color_df %>% 
  unnest_tokens(word, color_shades, token = 'regex', pattern=", ") %>% 
  group_by(color) %>%
  distinct() %>% 
  summarise(color_shades = paste0(sort(word), collapse = ", "))

Upvotes: 0

Wimpel
Wimpel

Reputation: 27732

data.table solution

library(data.table)
setDT(df)[, .(Color_shades = paste0(unique(unlist(strsplit(colorshades, ", "))), 
                                    collapse = ", ")), 
          by = .(colors)]
#    colors                                          Color_shades
# 1:   blue turquoise, babyblue, royal blue, true blue, navy blue

Upvotes: 0

DataM
DataM

Reputation: 351

if you can use the library "dplyr" you can also do it this way :

library(dplyr)

df <- data.frame("Colorshade" = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"),
             "Color" = c(rep("blue", 3)),
             stringsAsFactors = FALSE)

my_df <- df %>% group_by(Color) %>% mutate(Colorshade = paste(unique(sort(str_split(string = paste(df$Colorshade, collapse = ", "), pattern = ", ", simplify = TRUE))), collapse = ", ")) %>% first()

Upvotes: 0

r2evans
r2evans

Reputation: 160447

  1. Convert "Color shades" to a list-column:

    lapply(strsplit(df[["Color shades"]], ","), trimws)
    # [[1]]
    # [1] "turquoise" "babyblue" 
    # [[2]]
    # [1] "royal blue" "true blue" 
    # [[3]]
    # [1] "navy blue" "true blue"
    df[["Color shades"]] <- lapply(strsplit(df[["Color shades"]], ","), trimws)
    df
    #            Color shades Color
    # 1   turquoise, babyblue  blue
    # 2 royal blue, true blue  blue
    # 3  navy blue, true blue  blue
    
  2. Aggregate with unique:

    aggregate(df["Color shades"], df["Color"], function(z) paste(unique(unlist(z)), collapse=", "))
    #   Color                                          Color shades
    # 1  blue turquoise, babyblue, royal blue, true blue, navy blue
    

    or, keeping with the list-column method,

    aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z))))
    #   Color                                          Color shades
    # 1  blue turquoise, babyblue, royal blue, true blue, navy blue
    str(aggregate(df["Color shades"], df["Color"], function(z) list(unique(unlist(z)))))
    # 'data.frame': 1 obs. of  2 variables:
    #  $ Color       : chr "blue"
    #  $ Color shades:List of 1
    #   ..$ : chr  "turquoise" "babyblue" "royal blue" "true blue" ...
    

There are often (but not always) advantages to dealing with list-columns vice comma-separated values. If your use-case is such that you frequently want to look at individual elements within one of these fields, you'll find yourself dealing deeply with regexes and/or repeatedly using strsplit to separate. With list-columns, one can use tools like unique and %in% with abandon (though admittedly one should really become more comfortable with lapply/sapply, and many base-R tools for aggregation do not always work consistently with list-columns).


Data

df <- structure(list(`Color shades` = c("turquoise, babyblue", "royal blue, true blue", "navy blue, true blue"), Color = c("blue", "blue", "blue")), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 4

Related Questions