Reputation: 937
I am relatively new to R. My question isn't entirely as straightforward as the title. This is a sample of what df
looks like:
id amenities
1 wireless internet, air conditioning, pool, kitchen
2 pool, kitchen, washer, dryer
3 wireless internet, kitchen, dryer
4
5 wireless internet
this is what i want df
to look like:
id wireless internet air conditioning pool kitchen washer dryer
1 1 1 1 1 0 0
2 0 0 1 1 1 1
3 1 0 0 1 0 1
4 0 0 0 0 0 0
5 1 0 0 0 0 0
sample code to reproduce data
df <- data.frame(id = c(1, 2, 3, 4, 5),
amenities = c("wireless internet, air conditioning, pool, kitchen",
"pool, kitchen, washer, dryer",
"wireless internet, kitchen, dryer",
"",
"wireless internet"),
stringsAsFactors = FALSE)
Upvotes: 3
Views: 8097
Reputation: 42544
For the sake of completeness, here is also a concise data.table
solution:
library(data.table)
setDT(df)[, strsplit(amenities, ", "), by = id][
, dcast(.SD, id ~ V1, length)]
id air conditioning dryer kitchen pool washer wireless internet 1: 1 1 0 1 1 0 1 2: 2 0 1 1 1 1 0 3: 3 0 1 1 0 0 1 4: 5 0 0 0 0 0 1
After coercion to data.table, amenities
is split by ", "
into separate rows for each item (long format). This is then reshaped into wide format, using the length()
function to aggregate.
Upvotes: 0
Reputation: 6768
The dummies
package can be useful here. Try
library(dplyr); library(tidyr); library(dummies)
df2 <- df %>% separate_rows(amenities, sep = ",")
df2$amenities <- trimws(df2$amenities, "both") # remove spaces (left and right) - so that you will not have 2 "pool" columns in your final data frame
df2 <- dummy.data.frame(df2)[, -2]
colnames(df2) <- trimws(gsub("amenities", "", colnames(df2)), "both") # arrange colnames
df3 <- df2 %>%
group_by(id) %>%
summarise_all(funs(sum)) ## aggregate by column and id
df3
# A tibble: 5 x 7
#id `air conditioning` dryer kitchen pool washer `wireless internet`
#<dbl> <int> <int> <int> <int> <int> <int>
# 1 1 0 1 1 0 1
# 2 0 1 1 1 1 0
# 3 0 1 1 0 0 1
# 4 0 0 0 0 0 0
# 5 0 0 0 0 0 1
Upvotes: 0
Reputation: 174813
FWIW, here's a base R approach (assuming that df
contains your data as shown in the question)
dat <- with(df, strsplit(amenities, ', '))
df2 <- data.frame(id = factor(rep(df$id, times = lengths(dat)),
levels = df$id),
amenities = unlist(dat))
df3 <- as.data.frame(cbind(id = df$id,
table(df2$id, df2$amenities)))
This results in
> df3
id air conditioning dryer kitchen pool washer wireless internet
1 1 1 0 1 1 0 1
2 2 0 1 1 1 1 0
3 3 0 1 1 0 0 1
4 4 0 0 0 0 0 0
5 5 0 0 0 0 0 1
Breaking down what is going on:
dat <- with(df, strsplit(amenities, ', '))
splits the amenities
variable on ', '
, resulting in
> dat
[[1]]
[1] "wireless internet" "air conditioning" "pool"
[4] "kitchen"
[[2]]
[1] "pool" "kitchen" "washer" "dryer"
[[3]]
[1] "wireless internet" "kitchen" "dryer"
[[4]]
character(0)
[[5]]
[1] "wireless internet"
The second line takes dat
and turns it into a vector, and we add on and id
column by repeating the origina id
values as many times as the number of amenities for that id
. This results in
> df2
id amenities
1 1 wireless internet
2 1 air conditioning
3 1 pool
4 1 kitchen
5 2 pool
6 2 kitchen
7 2 washer
8 2 dryer
9 3 wireless internet
10 3 kitchen
11 3 dryer
12 5 wireless internet
Use the table()
function to create the contingency table and then we add on an id
column.
Upvotes: 6
Reputation: 39154
A solution using dplyr
and tidyr
. Notice that I replace ""
with None
because it is easier to process the column names later.
library(dplyr)
library(tidyr)
df2 <- df %>%
separate_rows(amenities, sep = ",") %>%
mutate(amenities = ifelse(amenities %in% "", "None", amenities)) %>%
mutate(value = 1) %>%
spread(amenities, value , fill = 0) %>%
select(-None)
df2
# id air conditioning dryer kitchen pool washer pool wireless internet
# 1 1 1 0 1 1 0 0 1
# 2 2 0 1 1 0 1 1 0
# 3 3 0 1 1 0 0 0 1
# 4 4 0 0 0 0 0 0 0
# 5 5 0 0 0 0 0 0 1
Upvotes: 3