PSraj
PSraj

Reputation: 229

Repeated rows In R with different sequence

I managed to get my raw data in this form(in R) i.e for each combination of products(combination of 3) their respective count, but as you can see there is duplication happening, there are rows with same product but different ordering of those products, I needed a way to combine those rows irrespective of the order and add the sum(num) of those to get the combined sum. This is just a part of the entire data set. Help me to figure out a way to do this.

             pages               sum(num)
Badezimmer,Baumarkt,Büromöbel          6
Badezimmer,Baumarkt,Dekoration        14
Badezimmer,Baumarkt,Flur              30
Badezimmer,Baumarkt,Garten            18
Badezimmer,Baumarkt,Heimtextilien    100
Badezimmer,Baumarkt,Kinder            28
Badezimmer,Büromöbel,Baumarkt         16
Badezimmer,Flur,Baumarkt              40

Upvotes: 1

Views: 106

Answers (2)

RHertel
RHertel

Reputation: 23818

Here is one possibility:

df1$pages <- as.character(df1$pages) # prevent use of factors    
df1$pages <- sapply(sapply(df1$pages,function(x) strsplit(x,",")),function(x) paste(sort(unlist(x)),collapse=',')) #split at commas, order words alphabetically, and restore the description
df1 <- aggregate(sum.num. ~ ., df1, sum) #sum over identical 'pages'
#                              pages sum.num.
#1     Badezimmer,Baumarkt,Büromöbel       22
#2    Badezimmer,Baumarkt,Dekoration       14
#3          Badezimmer,Baumarkt,Flur       70
#4        Badezimmer,Baumarkt,Garten       18
#5 Badezimmer,Baumarkt,Heimtextilien      100
#6        Badezimmer,Baumarkt,Kinder       28

data:

df1 <- structure(list(pages = structure(1:8, 
.Label = c("Badezimmer,Baumarkt,Büromöbel", 
"Badezimmer,Baumarkt,Dekoration", "Badezimmer,Baumarkt,Flur",
"Badezimmer,Baumarkt,Garten", "Badezimmer,Baumarkt,Heimtextilien",
"Badezimmer,Baumarkt,Kinder", "Badezimmer,Büromöbel,Baumarkt",
"Badezimmer,Flur,Baumarkt"), class = "factor"), 
sum.num. = c(6L, 14L, 30L, 18L, 100L, 28L, 16L, 40L)), 
.Names = c("pages", "sum.num."), class = "data.frame", 
row.names = c(NA, -8L))

Upvotes: 3

akrun
akrun

Reputation: 887991

Here is an option using cSplit from library(splitstackshape). We convert the 'data.frame' to 'data.table', create a row id column 'rn' with the option keep.rownames=TRUE, split the 'pages' column by ',' and convert to 'long' format with cSplit. Grouped by 'rn', we sort the 'pages' and paste it together, also we get the first value of 'sum.num.'. Then, we get the sum of 'sum.num.' grouped by 'pages'.

library(splitstackshape)
library(data.table)
cSplit(setDT(df1, keep.rownames=TRUE), 'pages', ',', 'long')[,
   list(pages=toString(sort(pages)), sum.num.= sum.num.[1]) ,rn
   ][,list(Sum=sum(sum.num.)) , .(pages)]
 #                                 pages Sum
#1:     Badezimmer, Baumarkt, Büromöbel  22
#2:    Badezimmer, Baumarkt, Dekoration  14
#3:          Badezimmer, Baumarkt, Flur  70
#4:        Badezimmer, Baumarkt, Garten  18
#5: Badezimmer, Baumarkt, Heimtextilien 100
#6:        Badezimmer, Baumarkt, Kinder  28

NOTE: 'df1' from @RHertel's post.

Upvotes: 4

Related Questions