Reputation: 115

Collapse redundant rows in data table

I have a data table in the format:

myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)

   Col1 Col2
1:    A    1
2:    A    2
3:    A    3
4:    B    4
5:    B    5
6:    B    6

I want show only the highest result for each category in Col1, then collapse all others and present their sum in Col2. It should look like this:

print(myTable)

       Col1 Col2
1:        A    3
2:   Others    3
3:        B    6
4:   Others    9

I managed to do it with the following code:

unique <- unique(myTable$Col1)                                  # unique values in Col1
myTable2 <- data.table()                                        # empty data table to populate
for(each in unique){
    temp <- myTable[Col1 == each, ]                             # filter myTable for unique Col1 values
    temp <- temp[order(-Col2)]                                  # order filtered table increasingly
    sumCol2 <- sum(temp$Col2)                                   # sum of values in filtered Col2
    temp <- temp[1, ] # retain only first element
    remSum <- sumCol2 - sum(temp$Col2)                          # remaining sum in Col2 (without first element)
    temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
    myTable2 <- rbindlist(list(myTable2, temp))                 # populate data table from beginning
}

This works, but I am trying to shorten a very large data table, so it takes forever.

Is there any better way to approach this?

Thanks.

UPDATE: Actually my procedure is a little bit more complicated. I figured I would be able to develop it myself after the basics were mastered but it seems I will need further help instead. I want to display the 5 highest values in Col1, and collapse the others, but some entries in Col1 do not have 5 values; in these case, all entries should be displayed, and no "Others" row should be added.

Upvotes: 1

Answers (5)

moodymudskipper

Reputation: 47350

Here's a base R solution and the dplyr equivalent:

res <- aggregate(Col2 ~.,transform(
  myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
#    Col0 Col2
# 1     A    3
# 3 Other    3
# 2     B    6
# 4 Other    9

myTable %>%
  group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
  summarize_at("Col2",sum) %>%
  ungroup %>%
  select(-1)
# # A tibble: 4 x 2
#   Col1   Col2
#   <chr> <int>
# 1 A         3
# 2 Other     3
# 3 B         6
# 4 Other     9

Upvotes: 0

Denny Ceccon

Reputation: 115

I did it! I made a new myTable to illustrate. I want to retain only the 4 highest values by category, and collapse the others.

set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)

    Col1 Col2
 1:    A    8
 2:    A    5
 3:    A    2
 4:    B    7
 5:    B   10
 6:    B    9
 7:    B   12
 8:    B   11
 9:    C    4
10:    C    6
11:    C    3
12:    C    1

# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)

# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]

# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum 
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]

# removes rows with NA inserted in first step
myTable <- na.omit(myTable)

# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4    entries
myTable <- myTable[Col2 != 0]

Owooooo!

Upvotes: 0

Andre Elrico

Reputation: 11500

do.call(
    rbind, lapply(split(myTable, factor(myTable$Col1)), function(x) rbind(x[which.max(x$Col2),], list("Other", sum(x$Col2[-which.max(x$Col2)]))))
)

#    Col1 Col2
#1:     A    3
#2: Other    3
#3:     B    6
#4: Other    9

Upvotes: 0

IceCreamToucan

Reputation: 28705

Here the data is split into groups according to the value of Col1 (by = Col1). .N is the index of the last row in the given group, so c(Col2[.N], sum(Col2) - Col2[.N])) gives the last value of Col2, and the sum of Col2 minus the last value. The newly created variables are surrounded by .() because .() is an alias for the list() function when using data.table, and the created columns need to go in a list.

library(data.table)
setDT(df)

df[, .(Col1 = c(Col1, 'Others'),
       Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
  , by = Col1][, -1]
#      Col1 Col2
# 1:      A    3
# 2: Others    3
# 3:      B    6
# 4: Others    9

Upvotes: 2

Nicolas2

Reputation: 2210

If it just a matter of displaying things you could the 'tables' packages :

others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)

# Col1        Col2
# A    last   3   
#      others 3   
# B    last   6   
#      others 9

Upvotes: 1

Collapse redundant rows in data table

Answers (5)

Related Questions