anaisf66
anaisf66

Reputation: 39

Select the rows with most available information in a data frame with R

I would like to select the rows with the highest number of information in a data frame. This data frame is generated automaticaly so the name of columns increase over the time.

the data are like.

Player  V1  F1  V2  F2  V3  F3  V4  F4
111111  0   0   1   3   0   0   1   3
111111  0   0   1   3   1   3   1   3
222222  3   4   0   0   3   4   3   4
222222  3   4   3   4   3   4   3   4
33333   1   2   1   2   1   2   1   2
33333   1   2   1   2   1   2   0   0

and it should be:

Player  V1  F1  V2  F2  V3  F3  V4  F4
111111  0   0   1   3   1   3   1   3
222222  3   4   3   4   3   4   3   4
33333   1   2   1   2   1   2   1   2

the idea is to select the rows with the most complete information. I'm considering 0 as incomplete information

Upvotes: 0

Views: 407

Answers (2)

user5249203
user5249203

Reputation: 4648

You mentioned the data frame is generated automatically so the name of columns increase over the time. Is it real time grouping you are trying to do ?

This data.table approach below should be good to group the Player column accordingly and select the max value. It works for the representative example you gave. This is similar to the answer provided @arun here. Group by one column, select row with minimum in one column for every pair of columns in R

require (data.table)
dt <- as.data.table(df)
# Get the column names
my_cols <- c("V1","F1","V2","F2","V3","F3","V4","F4")  

# Map applies function and subset across all the columns passed
# as vector my_cols, and mget return value of the named object

# data.table expression written in general form for understanding DT[i, j, by]
# missing i implies "on all rows".
# this expression computes the expression in 'j' grouped by 'Player'
dt[, Map(`[`, mget(my_cols), lapply(mget(my_cols), which.max)), by = Player]
#    Player V1 F1 V2 F2 V3 F3 V4 F4
# 1: 111111  0  0  1  3  1  3  1  3
# 2: 222222  3  4  3  4  3  4  3  4
# 3:  33333  1  2  1  2  1  2  1  2

Upvotes: 2

geekoverdose
geekoverdose

Reputation: 1007

As already pointed out by @Imo and @evan058, it's not clear what "most complete information" means. I assume you consider a 0 to be missing information, consequently that "most complete" refers to the entry with the least 0 entries per player:

This snippet should do the job then:

library(plyr)
newData <- ldply(unique(data$Player), function(player) {
  tmp <- data[data$Player == player,]
  tmp[which.max(rowSums(tmp[,-1] != 0)),]
})
print(newData)

Upvotes: 0

Related Questions