Tea Tree
Tea Tree

Reputation: 984

How to tidy dataset where columns are dummy variables and cell values are names of observations?

I have a really messy dataset where each column (correctly) corresponds to the variable of interest.

The dataset essentially counts individuals. For example, Var1 should be a list of people for whom Var1 is true. Imagine that the variables are flavors of ice cream. Var1 is chocolate ice cream. The way the data was recorded is such that instead of indicating whether or not (1/0 or T/F) somebody like chocolate ice cream, the dataset just contains the names of the people who like chocolate ice cream.

This list-style dataset makes it difficult to analyze the data because the rows do not correspond to individual observations. Right now, each column just contains a list of names. E.g. Var1 might be a list (not in the R sense but in the real world sense) of names of people who like chocolate ice cream.

To make this dataset amenable to analysis, I want to use the information to make every row in the dataset correspond to an observation and every cell value correspond to whether or not the observation is T/F for the given variable.

Right now, the dataset looks something like this:

Var1   Var2   Var3
Name1  Name1  Name2
Name2  Name3
Name4  Name4

or in terms of ice cream flavors:

Chocolate     Strawberry     Raspberry
Barbara       Barbara        Shanshan
Shanshan      Maria
Louis         Louis

So Barbara likes chocolate and strawberry ice cream but the dataset is messy in such a way that Shanshan's name is in the same row as Barbara's name. That shouldn't be the case. The first row should represent Barbara's values and the cell values should either be 1/0 or T/F indicating whether or not Barbara likes a particular flavor of ice cream.

In short, I would like it to look as follows

Var1   Var2   Var3
1      1      0
1      0      1
0      1      0
1      1      0

Upvotes: 1

Views: 109

Answers (2)

Darren Tsai
Darren Tsai

Reputation: 35604

library(tidyr)

df %>%
  pivot_longer(everything()) %>%
  drop_na(value) %>%
  pivot_wider(values_from = name,
              values_fill = list(name = 0),
              values_fn = list(name = ~1))

# # A tibble: 4 x 4
#   value  Var1  Var2  Var3
#   <chr> <dbl> <dbl> <dbl>
# 1 Name1     1     1     0
# 2 Name2     1     0     1
# 3 Name3     0     1     0
# 4 Name4     1     1     0

Description

  • values_fn = list(name = ~ 1): convert strings to 1

  • values_fill = list(name = 0): specify 0 to be filled when missing

Data

df <- structure(list(
  Var1 = c("Name1", "Name2", "Name4"),
  Var2 = c("Name1", "Name3", "Name4"),
  Var3 = c("Name2", NA, NA)
 ), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 1

Edward
Edward

Reputation: 19394

library(dplyr)
library(tidyr)
df %>% 
  pivot_longer(everything()) %>%
  filter(!is.na(value)) %>%
  pivot_wider(id_cols=value, values_from=name) %>%
  mutate_at(.vars=vars(-value), 
            .funs=~ifelse(is.na(.), 0, 1))

# A tibble: 4 x 4
  value  Var1  Var2  Var3
  <fct> <dbl> <dbl> <dbl>
1 Name1     1     1     0
2 Name2     1     0     1
3 Name3     0     1     0
4 Name4     1     1     0

Data


library(tibble)

df <- tribble(~Var1,   ~Var2,   ~Var3,
'Name1',  'Name1',  'Name2',
'Name2',  'Name3', NA,
'Name4',  'Name4', NA)

Upvotes: 1

Related Questions