Reputation: 60004
How do I summarize a data.table with unreliable data across multiple columns?
Specifically, given
fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
behavior=c(rep(FALSE,5),rep(TRUE,5)),
country=c(rep(1,4),rep(2,6)),
language=c(rep(6,6),rep(5,4)),
event=1:10, key=c("user",fields))
dt
# user behavior country language event
# 1: 3 FALSE 1 6 1
# 2: 3 FALSE 1 6 2
# 3: 3 FALSE 1 6 3
# 4: 3 FALSE 1 6 4
# 5: 3 FALSE 2 6 5
# 6: 4 TRUE 2 5 7
# 7: 4 TRUE 2 5 8
# 8: 4 TRUE 2 5 9
# 9: 4 TRUE 2 5 10
# 10: 4 TRUE 2 6 6
I want to get
# user behavior country.name country.support language.name language.support
# 1: 3 FALSE 1 0.8 6 1.0
# 2: 4 TRUE 2 1.0 5 0.8
(here the x.name
is the most common x for the user
and x.support
is the share events where this top x was observed)
without having to go through both fields
by hand like this:
users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
# user behavior country.name country.support language.name language.support
# 1: 3 FALSE 1 0.8 6 1.0
# 2: 4 TRUE 2 1.0 5 0.8
The actual number of fields
is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields
.
Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.
As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out]
.
Upvotes: 3
Views: 4001
Reputation: 4094
Does this solve your problem?
fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
behavior=c(rep(FALSE,5),rep(TRUE,5)),
country=c(rep(1,4),rep(2,6)),
language=c(rep(6,6),rep(5,4)),
event=1:10, key=c("user",fields))
CalculateSupport <- function(dt, name) {
x <- dt[, .N, by = eval(paste0('user,', name))]
setnames(x, name, 'name')
x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
x
}
users <- dt[, sum(behavior) > 0, by=user]
setnames(users, "V1", "behavior")
Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)
results in
user behavior country.name country.support language.name language.support
1: 3 FALSE 1 0.8 6 1.0
2: 4 TRUE 2 1.0 5 0.8
P.S. Please please take Ricardo's comment to your question seriously. SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect.
Upvotes: 5
Reputation: 537
I can't do it in one expression, since I am not sure how to reuse a created field in a data.table expression. It's also probably not the most efficient way. Maybe this will make a good starting point, though.
#Find most common country and language for each user
summ.dt<-dt[,list(behavior.summ=sum(behavior)>0,
country.name=dt[user==.BY[[1]],.N,by=country][N==max(N),country],
language.name=dt[user==.BY[[1]],.N,by=language][N==max(N),language]),
by=user]
#Get support for each country and language for each user
summ.dt[,c("country.support","language.support"):=list(
nrow(dt[user==.BY[[1]] & country==country.name])/nrow(dt[user==.BY[[1]]]),
nrow(dt[user==.BY[[1]] & language==language.name])/nrow(dt[user==.BY[[1]]])
),by=user]
user behavior.summ country.name language.name country.support language.support
1: 3 FALSE 1 6 0.8 1.0
2: 4 TRUE 2 5 1.0 0.8
Upvotes: 1