Reputation: 39
I would like to select the rows with the highest number of information in a data frame. This data frame is generated automaticaly so the name of columns increase over the time.
the data are like.
Player V1 F1 V2 F2 V3 F3 V4 F4
111111 0 0 1 3 0 0 1 3
111111 0 0 1 3 1 3 1 3
222222 3 4 0 0 3 4 3 4
222222 3 4 3 4 3 4 3 4
33333 1 2 1 2 1 2 1 2
33333 1 2 1 2 1 2 0 0
and it should be:
Player V1 F1 V2 F2 V3 F3 V4 F4
111111 0 0 1 3 1 3 1 3
222222 3 4 3 4 3 4 3 4
33333 1 2 1 2 1 2 1 2
the idea is to select the rows with the most complete information. I'm considering 0 as incomplete information
Upvotes: 0
Views: 407
Reputation: 4648
You mentioned the data frame is generated automatically so the name of columns increase over the time. Is it real time grouping you are trying to do ?
This data.table approach below should be good to group the Player column accordingly and select the max value. It works for the representative example you gave. This is similar to the answer provided @arun here. Group by one column, select row with minimum in one column for every pair of columns in R
require (data.table)
dt <- as.data.table(df)
# Get the column names
my_cols <- c("V1","F1","V2","F2","V3","F3","V4","F4")
# Map applies function and subset across all the columns passed
# as vector my_cols, and mget return value of the named object
# data.table expression written in general form for understanding DT[i, j, by]
# missing i implies "on all rows".
# this expression computes the expression in 'j' grouped by 'Player'
dt[, Map(`[`, mget(my_cols), lapply(mget(my_cols), which.max)), by = Player]
# Player V1 F1 V2 F2 V3 F3 V4 F4
# 1: 111111 0 0 1 3 1 3 1 3
# 2: 222222 3 4 3 4 3 4 3 4
# 3: 33333 1 2 1 2 1 2 1 2
Upvotes: 2
Reputation: 1007
As already pointed out by @Imo and @evan058, it's not clear what "most complete information" means. I assume you consider a 0
to be missing information, consequently that "most complete" refers to the entry with the least 0
entries per player:
This snippet should do the job then:
library(plyr)
newData <- ldply(unique(data$Player), function(player) {
tmp <- data[data$Player == player,]
tmp[which.max(rowSums(tmp[,-1] != 0)),]
})
print(newData)
Upvotes: 0