Reputation: 943
How can i reformat and subset a dataframe with following condition a) unique(df1$Name) as rownames, b)unique(df1$ID) ad colnames, c)values for each unique ID for each Name. Note: if a unique name has repeated IDs, select only values for those IDs with highest frequency
df1 <-
Name ID value Frequency
AA A-1 0 50
AA A-2 1 30
AA A-3 0 10
AA A-3 1 60
AA A-1 1 10
AA A-4 1 40
AA A-5 1 50
BB A-1 0 10
BB A-1 1 30
BB A-2 1 20
BB A-3 0 10
BB A-4 1 40
BB A-4 0 60
BB A-2 1 10
CC A-1 0 10
CC A-2 0 20
CC A-3 1 10
CC A-3 1 15
CC A-1 1 17
expectedoutput <-
Name A-1 A-2 A-3 A-4 A-5
AA 0 1 1 1 1
BB 1 1 0 0
CC 1 0 1
Upvotes: 1
Views: 52
Reputation: 887851
We could use dcast
to convert from 'long' to 'wide' after filtering out the rows with the highest 'Frequency' for the groups 'Name', 'ID'. setDT
converts the data.frame
to data.table
, which.max
gets the numeric index of row with maximum 'Frequency' for the group (by=list(Name, ID)
). We subset the dataset based on the index (.SD
- Subset of Data.table
), then use dcast
.
library(data.table)#v.1.9.5+
dcast(setDT(df1)[, .SD[which.max(Frequency)], by = list(Name, ID)],
Name~ID, value.var='value', fill=0)
# Name A-1 A-2 A-3 A-4 A-5
#1: AA 0 1 1 1 1
#2: BB 1 1 0 0 0
#3: CC 1 0 1 0 0
Or we can use a similar approach in dplyr/tidyr
by filtering out the maximum Frequency rows with slice
after grouping, and then use spread
to change the format from 'long' to 'wide'.
library(dplyr)
library(tidyr)
df1 %>%
group_by(Name, ID) %>%
slice(which.max(Frequency)) %>%
select(-Frequency) %>%
spread(ID, value, fill=0)
NOTE: Both spread
, and dcast
have fill
option. By default it is NA
.
df1 <- structure(list(Name = c("AA", "AA", "AA", "AA", "AA", "AA",
"AA",
"BB", "BB", "BB", "BB", "BB", "BB", "BB", "CC", "CC", "CC", "CC",
"CC"), ID = c("A-1", "A-2", "A-3", "A-3", "A-1", "A-4", "A-5",
"A-1", "A-1", "A-2", "A-3", "A-4", "A-4", "A-2", "A-1", "A-2",
"A-3", "A-3", "A-1"), value = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L,
1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L), Frequency = c(50L,
30L, 10L, 60L, 10L, 40L, 50L, 10L, 30L, 20L, 10L, 40L, 60L, 10L,
10L, 20L, 10L, 15L, 17L)), .Names = c("Name", "ID", "value",
"Frequency"), class = "data.frame", row.names = c(NA, -19L))
Upvotes: 1