Reputation: 1008
I have very large data.table
that I want to trim down in this fashion:
Only one unique id
If there is any other data than "X" in the same log, that other should stay
If only X, then the first X should stay
If there is more than one other than "X", then all those should stay, separated by commas, but not the "X".
Sample dataset:
library(data.table)
dt <- data.table(
id=c(1,1,2,3,3,4,4,4,5,5),
log=c(11,11,11,12,12,12,12,12,13,13),
art=c("X", "Y", "X", "X", "X", "Z", "X", "Y","X", "X")
)
dt
id log art
1: 1 11 X
2: 1 11 Y
3: 2 11 X
4: 3 12 X
5: 3 12 X
6: 4 12 Z
7: 4 12 X
8: 4 12 Y
9: 5 13 X
10: 5 13 X
Required output:
id log art
1 11 Y
2 11 Y
3 12 Z,Y
4 12 Z,Y
5 13 X
Upvotes: 1
Views: 366
Reputation: 38500
Here is one method, though there maybe a more efficient approach.
unique(dt[,.(id, log)])[dt[, .(art=if(.N == 1 | all(art == "X"))
art[1] else toString(unique(art[art != "X"]))),
by=log], on="log"]
which returns
id log art
1: 1 11 Y
2: 2 11 Y
3: 3 12 Z, Y
4: 4 12 Z, Y
5: 5 13 X
perform a left join of the desired values of art by each log onto the unique pairs of ID and log. This assumes that no ID spans two logs, which is the case in the example.
Upvotes: 2
Reputation: 28825
Just wanted to try this with dplyr
:
library(data.table)
library(dplyr)
dat <- setDT(dt %>% group_by(id) %>%
unique() %>%
summarise(bldlog = mean(log),
art = gsub("X,|,X", "",paste(art, collapse = ","))))
dat
# id bldlog art
# 1: 1 11 Y
# 2: 2 11 X
# 3: 3 12 X
# 4: 4 12 Z,Y
# 5: 5 13 X
Upvotes: 0
Reputation: 886948
We can try
dt[, .(art = if(all(art=="X")) "X" else
toString(unique(art[art != "X"]))), .(id, logbld = log)]
# id logbld art
#1: 1 11 Y
#2: 2 11 X
#3: 3 12 X
#4: 4 12 Z, Y
#5: 5 13 X
Upvotes: 2