Reputation: 984
I have a really messy dataset where each column (correctly) corresponds to the variable of interest.
The dataset essentially counts individuals. For example, Var1 should be a list of people for whom Var1 is true. Imagine that the variables are flavors of ice cream. Var1 is chocolate ice cream. The way the data was recorded is such that instead of indicating whether or not (1/0 or T/F) somebody like chocolate ice cream, the dataset just contains the names of the people who like chocolate ice cream.
This list-style dataset makes it difficult to analyze the data because the rows do not correspond to individual observations. Right now, each column just contains a list of names. E.g. Var1 might be a list (not in the R sense but in the real world sense) of names of people who like chocolate ice cream.
To make this dataset amenable to analysis, I want to use the information to make every row in the dataset correspond to an observation and every cell value correspond to whether or not the observation is T/F for the given variable.
Right now, the dataset looks something like this:
Var1 Var2 Var3
Name1 Name1 Name2
Name2 Name3
Name4 Name4
or in terms of ice cream flavors:
Chocolate Strawberry Raspberry
Barbara Barbara Shanshan
Shanshan Maria
Louis Louis
So Barbara likes chocolate and strawberry ice cream but the dataset is messy in such a way that Shanshan's name is in the same row as Barbara's name. That shouldn't be the case. The first row should represent Barbara's values and the cell values should either be 1/0 or T/F indicating whether or not Barbara likes a particular flavor of ice cream.
In short, I would like it to look as follows
Var1 Var2 Var3
1 1 0
1 0 1
0 1 0
1 1 0
Upvotes: 1
Views: 109
Reputation: 35604
library(tidyr)
df %>%
pivot_longer(everything()) %>%
drop_na(value) %>%
pivot_wider(values_from = name,
values_fill = list(name = 0),
values_fn = list(name = ~1))
# # A tibble: 4 x 4
# value Var1 Var2 Var3
# <chr> <dbl> <dbl> <dbl>
# 1 Name1 1 1 0
# 2 Name2 1 0 1
# 3 Name3 0 1 0
# 4 Name4 1 1 0
Description
values_fn = list(name = ~ 1)
: convert strings to 1
values_fill = list(name = 0)
: specify 0 to be filled when missing
Data
df <- structure(list(
Var1 = c("Name1", "Name2", "Name4"),
Var2 = c("Name1", "Name3", "Name4"),
Var3 = c("Name2", NA, NA)
), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 1
Reputation: 19394
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
filter(!is.na(value)) %>%
pivot_wider(id_cols=value, values_from=name) %>%
mutate_at(.vars=vars(-value),
.funs=~ifelse(is.na(.), 0, 1))
# A tibble: 4 x 4
value Var1 Var2 Var3
<fct> <dbl> <dbl> <dbl>
1 Name1 1 1 0
2 Name2 1 0 1
3 Name3 0 1 0
4 Name4 1 1 0
Data
library(tibble)
df <- tribble(~Var1, ~Var2, ~Var3,
'Name1', 'Name1', 'Name2',
'Name2', 'Name3', NA,
'Name4', 'Name4', NA)
Upvotes: 1