I have very large data.table that I want to trim down in this fashion: Only one unique id If there is any other data than "X" in the same log, that other should stay If only X, then the first X should stay If there is more than one other than "X", then all those should stay, separated by commas, but not the "X". Sample dataset: library(data.table) dt <- data.table( id=c(1,1,2,3,3,4,4,4,5,5), log=c(11,11,11,12,12,12,12,12,13,13), art=c("X", "Y", "X", "X", "X", "Z", "X", "Y","X", "X") ) dt id log art 1: 1 11 X 2: 1 11 Y 3: 2 11 X 4: 3 12 X 5: 3 12 X 6: 4 12 Z 7: 4 12 X 8: 4 12 Y 9: 5 13 X 10: 5 13 X Required output: id log art 1 11 Y 2 11 Y 3 12 Z,Y 4 12 Z,Y 5 13 X

Reputation: 1008

Conditional sorting of a data.table

I have very large data.table that I want to trim down in this fashion:

Only one unique id
If there is any other data than "X" in the same log, that other should stay
If only X, then the first X should stay
If there is more than one other than "X", then all those should stay, separated by commas, but not the "X".

Sample dataset:

library(data.table)
dt <- data.table(
    id=c(1,1,2,3,3,4,4,4,5,5),
    log=c(11,11,11,12,12,12,12,12,13,13),
    art=c("X", "Y", "X", "X", "X", "Z", "X", "Y","X", "X")
  )
dt
id log art
1:  1  11   X
2:  1  11   Y
3:  2  11   X
4:  3  12   X
5:  3  12   X
6:  4  12   Z
7:  4  12   X
8:  4  12   Y
9:  5  13   X
10: 5  13   X

Required output:

 id    log  art    
  1     11   Y    
  2     11   Y
  3     12   Z,Y
  4     12   Z,Y
  5     13   X

Upvotes: 1

Answers (3)

lmo

Reputation: 38500

Here is one method, though there maybe a more efficient approach.

unique(dt[,.(id, log)])[dt[, .(art=if(.N == 1 | all(art == "X"))
                                      art[1] else toString(unique(art[art != "X"]))),
                           by=log], on="log"]

which returns

   id log  art
1:  1  11    Y
2:  2  11    Y
3:  3  12 Z, Y
4:  4  12 Z, Y
5:  5  13    X

perform a left join of the desired values of art by each log onto the unique pairs of ID and log. This assumes that no ID spans two logs, which is the case in the example.

Upvotes: 2

M--

Reputation: 28825

Just wanted to try this with dplyr:

library(data.table)
library(dplyr)
   dat <- setDT(dt %>% group_by(id) %>% 
                       unique() %>% 
                       summarise(bldlog = mean(log),
                                 art = gsub("X,|,X", "",paste(art, collapse = ","))))


dat

#    id bldlog art 
# 1:  1     11   Y 
# 2:  2     11   X 
# 3:  3     12   X 
# 4:  4     12 Z,Y 
# 5:  5     13   X

Upvotes: 0

akrun

Reputation: 886948

We can try

dt[,  .(art = if(all(art=="X")) "X" else 
     toString(unique(art[art != "X"]))), .(id, logbld = log)]
#    id logbld  art
#1:  1     11    Y
#2:  2     11    X
#3:  3     12    X
#4:  4     12 Z, Y
#5:  5     13    X

Upvotes: 2

Conditional sorting of a data.table

Answers (3)

Related Questions