Reputation: 37
I received a dataset from a survey, and it has a rather strange structure. Every row has 90 variables, with the first 15 being general questions and the other 75 being 5 sets of 15 variables, that are actually observations. The concrete case is an employer being asked questions about employees. It holds the following structure:
set.seed(42)
n <- 100
data <- data.frame(
EmployerVar1=1:n,,
EmployerVar2=rnorm(n),
EmployerVar3=rnorm(n),
EmployerVar4=rnorm(n),
EmployerVar5=rnorm(n),
EmployerVar6=rnorm(n),
EmployerVar7=rnorm(n),
EmployerVar8=rnorm(n),
EmployerVar9=rnorm(n),
EmployerVar10=rnorm(n),
EmployerVar11=rnorm(n),
EmployerVar12=rnorm(n),
EmployerVar13=rnorm(n),
EmployerVar14=rnorm(n),
EmployerVar15=rnorm(n),
Employee1Var1=rnorm(n),
Employee1Var2=rnorm(n),
Employee1Var3=rnorm(n),
Employee1Var4=rnorm(n),
Employee1Var5=rnorm(n),
Employee1Var6=rnorm(n),
Employee1Var7=rnorm(n),
Employee1Var8=rnorm(n),
Employee1Var9=rnorm(n),
Employee1Var10=rnorm(n),
Employee1Var11=rnorm(n),
Employee1Var12=rnorm(n),
Employee1Var13=rnorm(n),
Employee1Var14=rnorm(n),
Employee1Var15=rnorm(n),
Employee2Var1=rnorm(n),
Employee2Var2=rnorm(n),
Employee2Var3=rnorm(n),
Employee2Var4=rnorm(n),
Employee2Var5=rnorm(n),
Employee2Var6=rnorm(n),
Employee2Var7=rnorm(n),
Employee2Var8=rnorm(n),
Employee2Var9=rnorm(n),
Employee2Var10=rnorm(n),
Employee2Var11=rnorm(n),
Employee2Var12=rnorm(n),
Employee2Var13=rnorm(n),
Employee2Var14=rnorm(n),
Employee2Var15=rnorm(n),
Employee3Var1=rnorm(n),
Employee3Var2=rnorm(n),
Employee3Var3=rnorm(n),
Employee3Var4=rnorm(n),
Employee3Var5=rnorm(n),
Employee3Var6=rnorm(n),
Employee3Var7=rnorm(n),
Employee3Var8=rnorm(n),
Employee3Var9=rnorm(n),
Employee3Var10=rnorm(n),
Employee3Var11=rnorm(n),
Employee3Var12=rnorm(n),
Employee3Var13=rnorm(n),
Employee3Var14=rnorm(n),
Employee3Var15=rnorm(n),
Employee4Var1=rnorm(n),
Employee4Var2=rnorm(n),
Employee4Var3=rnorm(n),
Employee4Var4=rnorm(n),
Employee4Var5=rnorm(n),
Employee4Var6=rnorm(n),
Employee4Var7=rnorm(n),
Employee4Var8=rnorm(n),
Employee4Var9=rnorm(n),
Employee4Var10=rnorm(n),
Employee4Var11=rnorm(n),
Employee4Var12=rnorm(n),
Employee4Var13=rnorm(n),
Employee4Var14=rnorm(n),
Employee4Var15=rnorm(n),
Employee5Var1=rnorm(n),
Employee5Var2=rnorm(n),
Employee5Var3=rnorm(n),
Employee5Var4=rnorm(n),
Employee5Var5=rnorm(n),
Employee5Var6=rnorm(n),
Employee5Var7=rnorm(n),
Employee5Var8=rnorm(n),
Employee5Var9=rnorm(n),
Employee5Var10=rnorm(n),
Employee5Var11=rnorm(n),
Employee5Var12=rnorm(n),
Employee5Var13=rnorm(n),
Employee5Var14=rnorm(n),
Employee5Var15=rnorm(n))
For analysis the dataset needs to have every observation in a new row, being the employer characteristics with the 15 variables linked to one employee. So the number of rows needs to increase fivefold.
I asked this question once before and while I thought this was the solution, it eventually didn't work. The suggestion was to:
library(tidyr)
X_wide <- data.frame(id = 1:3, P1 = 4:6, P2 = 7:9, P3 = 10:12)
X_long <- pivot_longer(X_wide, cols = P1:P3, names_to = "person", values_to = "score")
X_long <- as.data.frame(X_long)
This does not work as it would collapse all observation-columns in one new variable. All the data needs to remain the same, it only needs to land in a new spot. To clarify, this is the structure that I would need for analysis.
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
---|---|---|---|---|---|
EmployerVar1 | EmployerVar2 | EmployerVar3 | Employee1Var1 | Employee1Var2 | Employee1Var3 |
EmployerVar1 | EmployerVar2 | EmployerVar3 | Employee2Var1 | Employee2Var2 | Employee2Var3 |
EmployerVar1 | EmployerVar2 | EmployerVar3 | Employee3Var1 | Employee3Var2 | Employee3Var3 |
Upvotes: 1
Views: 495
Reputation: 2491
You're trying to reshape the data slightly longer.
This is not a simple pivot_longer()
because:
Employee
. You can fix this by using starts_with()
.Employee
) needs to create new rows, but the second part (Question
) needs to stay in separate columns. For this, you can use separate()
to split the two parts, and pivot_wider()
.To show you a possible solution, I recreated a minimum reproducible dataset from scratch.
The problem can be simplified to a minimum dataset of:
The sample data you provided is different to your actual data: you used rnorm()
to simulate all columns, but you said in the comments that your dataset actually contains mostly categorical data. This is relevant because having different data types will complicate the pivotting. I made an assumption that your dataset contains only categorical or numeric columns.
I also slightly changed Var
into .Qst
so that it was clearer to me what they meant, and it was easier to split the employee part from the question number part.
library(tidyr)
library(dplyr)
dat <- data.frame(
EmployerVar1 = c("a", "b"),
EmployerVar2 = c("d", "e"),
Employee1.Qst1 = c("A", "B"),
Employee1.Qst2 = c(1.5, 2.5),
Employee2.Qst1 = c("F", "G"),
Employee2.Qst2 = c(-6,-7)
)
dat |>
# Change all numerical columns to character
mutate(across(where(is.double), as.character)) |>
# Pivot longer
pivot_longer(cols = starts_with("Employee"),
names_to = "Employee.Qst") |>
# Split
separate(col = Employee.Qst, c("Employee", "Question")) |>
# Pivot wider
pivot_wider(names_from = Question,
values_from = value)
PS I really recommend going through the pivoting vignette from tidyr
, it's really helpful.
Upvotes: 1