Reputation: 229
I managed to get my raw data in this form(in R) i.e for each combination of products(combination of 3) their respective count, but as you can see there is duplication happening, there are rows with same product but different ordering of those products, I needed a way to combine those rows irrespective of the order and add the sum(num) of those to get the combined sum. This is just a part of the entire data set. Help me to figure out a way to do this.
pages sum(num)
Badezimmer,Baumarkt,Büromöbel 6
Badezimmer,Baumarkt,Dekoration 14
Badezimmer,Baumarkt,Flur 30
Badezimmer,Baumarkt,Garten 18
Badezimmer,Baumarkt,Heimtextilien 100
Badezimmer,Baumarkt,Kinder 28
Badezimmer,Büromöbel,Baumarkt 16
Badezimmer,Flur,Baumarkt 40
Upvotes: 1
Views: 106
Reputation: 23818
Here is one possibility:
df1$pages <- as.character(df1$pages) # prevent use of factors
df1$pages <- sapply(sapply(df1$pages,function(x) strsplit(x,",")),function(x) paste(sort(unlist(x)),collapse=',')) #split at commas, order words alphabetically, and restore the description
df1 <- aggregate(sum.num. ~ ., df1, sum) #sum over identical 'pages'
# pages sum.num.
#1 Badezimmer,Baumarkt,Büromöbel 22
#2 Badezimmer,Baumarkt,Dekoration 14
#3 Badezimmer,Baumarkt,Flur 70
#4 Badezimmer,Baumarkt,Garten 18
#5 Badezimmer,Baumarkt,Heimtextilien 100
#6 Badezimmer,Baumarkt,Kinder 28
data:
df1 <- structure(list(pages = structure(1:8,
.Label = c("Badezimmer,Baumarkt,Büromöbel",
"Badezimmer,Baumarkt,Dekoration", "Badezimmer,Baumarkt,Flur",
"Badezimmer,Baumarkt,Garten", "Badezimmer,Baumarkt,Heimtextilien",
"Badezimmer,Baumarkt,Kinder", "Badezimmer,Büromöbel,Baumarkt",
"Badezimmer,Flur,Baumarkt"), class = "factor"),
sum.num. = c(6L, 14L, 30L, 18L, 100L, 28L, 16L, 40L)),
.Names = c("pages", "sum.num."), class = "data.frame",
row.names = c(NA, -8L))
Upvotes: 3
Reputation: 887991
Here is an option using cSplit
from library(splitstackshape)
. We convert the 'data.frame' to 'data.table', create a row id column 'rn' with the option keep.rownames=TRUE
, split
the 'pages' column by ',' and convert to 'long' format with cSplit
. Grouped by 'rn', we sort
the 'pages' and paste
it together, also we get the first value of 'sum.num.'. Then, we get the sum
of 'sum.num.' grouped by 'pages'.
library(splitstackshape)
library(data.table)
cSplit(setDT(df1, keep.rownames=TRUE), 'pages', ',', 'long')[,
list(pages=toString(sort(pages)), sum.num.= sum.num.[1]) ,rn
][,list(Sum=sum(sum.num.)) , .(pages)]
# pages Sum
#1: Badezimmer, Baumarkt, Büromöbel 22
#2: Badezimmer, Baumarkt, Dekoration 14
#3: Badezimmer, Baumarkt, Flur 70
#4: Badezimmer, Baumarkt, Garten 18
#5: Badezimmer, Baumarkt, Heimtextilien 100
#6: Badezimmer, Baumarkt, Kinder 28
NOTE: 'df1' from @RHertel's post.
Upvotes: 4