Kryo
Kryo

Reputation: 943

subsetting and reformatting dataframe with uniques characters

How can i reformat and subset a dataframe with following condition a) unique(df1$Name) as rownames, b)unique(df1$ID) ad colnames, c)values for each unique ID for each Name. Note: if a unique name has repeated IDs, select only values for those IDs with highest frequency

df1 <-

 Name   ID   value  Frequency
    AA  A-1     0   50
    AA  A-2     1   30
    AA  A-3     0   10
    AA  A-3     1   60
    AA  A-1     1   10
    AA  A-4     1   40
    AA  A-5     1   50
    BB  A-1     0   10
    BB  A-1     1   30
    BB  A-2     1   20
    BB  A-3     0   10
    BB  A-4     1   40
    BB  A-4     0   60
    BB  A-2     1   10
    CC  A-1     0   10
    CC  A-2     0   20
    CC  A-3     1   10
    CC  A-3     1   15
    CC  A-1     1   17

expectedoutput <-

Name    A-1 A-2 A-3 A-4 A-5
    AA  0   1   1   1   1
    BB  1   1   0   0   
    CC  1   0   1       

Upvotes: 1

Views: 52

Answers (1)

akrun
akrun

Reputation: 887851

We could use dcast to convert from 'long' to 'wide' after filtering out the rows with the highest 'Frequency' for the groups 'Name', 'ID'. setDT converts the data.frame to data.table, which.max gets the numeric index of row with maximum 'Frequency' for the group (by=list(Name, ID)). We subset the dataset based on the index (.SD- Subset of Data.table), then use dcast.

library(data.table)#v.1.9.5+
dcast(setDT(df1)[, .SD[which.max(Frequency)], by = list(Name, ID)], 
                            Name~ID, value.var='value', fill=0)
#    Name A-1 A-2 A-3 A-4 A-5
#1:   AA   0   1   1   1   1
#2:   BB   1   1   0   0   0
#3:   CC   1   0   1   0   0

Or we can use a similar approach in dplyr/tidyr by filtering out the maximum Frequency rows with slice after grouping, and then use spread to change the format from 'long' to 'wide'.

library(dplyr)
library(tidyr)
 df1 %>% 
    group_by(Name, ID) %>%
    slice(which.max(Frequency)) %>% 
    select(-Frequency) %>% 
    spread(ID, value, fill=0)

NOTE: Both spread, and dcast have fill option. By default it is NA.

data

df1 <- structure(list(Name = c("AA", "AA", "AA", "AA", "AA", "AA", 
"AA", 
"BB", "BB", "BB", "BB", "BB", "BB", "BB", "CC", "CC", "CC", "CC", 
"CC"), ID = c("A-1", "A-2", "A-3", "A-3", "A-1", "A-4", "A-5", 
"A-1", "A-1", "A-2", "A-3", "A-4", "A-4", "A-2", "A-1", "A-2", 
"A-3", "A-3", "A-1"), value = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L), Frequency = c(50L, 
30L, 10L, 60L, 10L, 40L, 50L, 10L, 30L, 20L, 10L, 40L, 60L, 10L, 
10L, 20L, 10L, 15L, 17L)), .Names = c("Name", "ID", "value", 
"Frequency"), class = "data.frame", row.names = c(NA, -19L))

Upvotes: 1

Related Questions