Reputation: 391
I have data that looks like this:
library(dplyr)
d<-data.frame(ID=c(1,1,2,3,3,4), Quality=c("Good", "Bad", "Ugly", "Good", "Good", "Ugly"), Area=c("East", "North", "North", "South", "East", "North"))
What I'd like to do is create one new column for each unique value in Quality and populate it with whether the ID matches that value and then aggregate the ID's. I want to do the same for Area.
This is what I have for when Quality == Good:
d$Quality.Good <- 0
d$Quality.Good[d$Quality=="Good"] <- 1
e <- d %>%
group_by(ID) %>%
summarise(n=n(), MAX.Quality.Good = max(Quality.Good))
e
Output
A tibble: 4 x 3
ID MAX.Quality.Good
<dbl> <dbl>
1 1 1
2 2 0
3 3 1
4 4 0
Is it possible to build a function that will loop through each character column and build an indicator column for Good, Bad, Ugly, North, East, South instead of copy pasting the above many more times?
Here's where I'm stuck:
library(stringr)
#vector of each Quality
e <-d %>%
group_by(Quality) %>%
summarise(n=n()) %>%
select(Quality)
e<-as.data.frame(e)
#create new column names
f <- str_c(names(e),".",e[,1])
#initialize list of new columns
d[f] <- 0
#I'm stuck after this...
Thank you!
Upvotes: 1
Views: 316
Reputation: 269654
1) Base R Create the model matrix for each column (using function make_mm
) and bind them together as a data frame m
. Finally aggregate on ID
. No packages are used.
make_mm <- function(nm, data) model.matrix(~ . - 1, data[nm])
m <- do.call("data.frame", lapply(names(d)[-1], make_mm, d))
with(d, aggregate(. ~ ID, m, max))
giving:
ID QualityBad QualityGood QualityUgly AreaEast AreaNorth AreaSouth
1 1 1 1 0 1 1 0
2 2 0 0 1 0 1 0
3 3 0 1 0 1 0 1
4 4 0 0 1 0 1 0
2) dplyr/purrr This could alternately be written as the following which is close to the code in the question but generalizes to all required columns. Note that here we make model data frames using make_md
rather than making model matrices with make_mm
. Also note that the dot in group_by(m, ID = .$ID)
refers to d
and not to m
.
library(dplyr)
library(purrr)
make_md <- function(nm, data) {
data %>%
select(nm) %>%
model.matrix(~ . - 1, .) %>%
as.data.frame
}
d %>% {
m <- map_dfc(names(.)[-1], make_md, .)
group_by(m, ID = .$ID) %>%
summarize_all(max) %>%
ungroup
}
Upvotes: 1
Reputation: 887158
We can do this in base R
using table
by replicating the 'ID' column by the number of columns of dataset minus 1, and paste
ing the column names with the unlist
ed values (excluding the 'ID' column)
table(rep(d$ID, 2), paste0(names(d)[-1][col(d[-1])], unlist(d[-1])))
# AreaEast AreaNorth AreaSouth QualityBad QualityGood QualityUgly
# 1 1 1 0 1 1 0
# 2 0 1 0 0 0 1
# 3 1 0 1 0 2 0
# 4 0 1 0 0 0 1
or with tidyverse
, gather
into 'long' format, unite
the 'key', 'val' columns to a single column, get the distinct
rows, and spread
into 'wide' format after creating a column of 1s.
library(tidyverse)
gather(d, key, val, -ID) %>%
unite(kv, key, val) %>%
distinct %>%
mutate(n = 1) %>%
spread(kv, n, fill = 0)
#ID Area_East Area_North Area_South Quality_Bad Quality_Good Quality_Ugly
#1 1 1 1 0 1 1 0
#2 2 0 1 0 0 0 1
#3 3 1 0 1 0 1 0
#4 4 0 1 0 0 0 1
Upvotes: 1