How can I make new observations from columns in R?

Question

I received a dataset from a survey, and it has a rather strange structure. Every row has 90 variables, with the first 15 being general questions and the other 75 being 5 sets of 15 variables, that are actually observations. The concrete case is an employer being asked questions about employees. It holds the following structure:

set.seed(42) 
n <- 100
data <- data.frame(
                  EmployerVar1=1:n,,
                  EmployerVar2=rnorm(n),
                  EmployerVar3=rnorm(n),
                  EmployerVar4=rnorm(n),
                  EmployerVar5=rnorm(n),
                  EmployerVar6=rnorm(n),
                  EmployerVar7=rnorm(n),
                  EmployerVar8=rnorm(n),
                  EmployerVar9=rnorm(n),
                  EmployerVar10=rnorm(n),
                  EmployerVar11=rnorm(n),
                  EmployerVar12=rnorm(n),
                  EmployerVar13=rnorm(n),
                  EmployerVar14=rnorm(n),
                  EmployerVar15=rnorm(n),
                  Employee1Var1=rnorm(n),
                  Employee1Var2=rnorm(n),
                  Employee1Var3=rnorm(n),
                  Employee1Var4=rnorm(n),
                  Employee1Var5=rnorm(n),
                  Employee1Var6=rnorm(n),
                  Employee1Var7=rnorm(n),
                  Employee1Var8=rnorm(n),
                  Employee1Var9=rnorm(n),
                  Employee1Var10=rnorm(n),
                  Employee1Var11=rnorm(n),
                  Employee1Var12=rnorm(n),
                  Employee1Var13=rnorm(n),
                  Employee1Var14=rnorm(n),
                  Employee1Var15=rnorm(n),
                  Employee2Var1=rnorm(n),
                  Employee2Var2=rnorm(n),
                  Employee2Var3=rnorm(n),
                  Employee2Var4=rnorm(n),
                  Employee2Var5=rnorm(n),
                  Employee2Var6=rnorm(n),
                  Employee2Var7=rnorm(n),
                  Employee2Var8=rnorm(n),
                  Employee2Var9=rnorm(n),
                  Employee2Var10=rnorm(n),
                  Employee2Var11=rnorm(n),
                  Employee2Var12=rnorm(n),
                  Employee2Var13=rnorm(n),
                  Employee2Var14=rnorm(n),
                  Employee2Var15=rnorm(n),
                  Employee3Var1=rnorm(n),
                  Employee3Var2=rnorm(n),
                  Employee3Var3=rnorm(n),
                  Employee3Var4=rnorm(n),
                  Employee3Var5=rnorm(n),
                  Employee3Var6=rnorm(n),
                  Employee3Var7=rnorm(n),
                  Employee3Var8=rnorm(n),
                  Employee3Var9=rnorm(n),
                  Employee3Var10=rnorm(n),
                  Employee3Var11=rnorm(n),
                  Employee3Var12=rnorm(n),
                  Employee3Var13=rnorm(n),
                  Employee3Var14=rnorm(n),
                  Employee3Var15=rnorm(n),
                  Employee4Var1=rnorm(n),
                  Employee4Var2=rnorm(n),
                  Employee4Var3=rnorm(n),
                  Employee4Var4=rnorm(n),
                  Employee4Var5=rnorm(n),
                  Employee4Var6=rnorm(n),
                  Employee4Var7=rnorm(n),
                  Employee4Var8=rnorm(n),
                  Employee4Var9=rnorm(n),
                  Employee4Var10=rnorm(n),
                  Employee4Var11=rnorm(n),
                  Employee4Var12=rnorm(n),
                  Employee4Var13=rnorm(n),
                  Employee4Var14=rnorm(n),
                  Employee4Var15=rnorm(n),
                  Employee5Var1=rnorm(n),
                  Employee5Var2=rnorm(n),
                  Employee5Var3=rnorm(n),
                  Employee5Var4=rnorm(n),
                  Employee5Var5=rnorm(n),
                  Employee5Var6=rnorm(n),
                  Employee5Var7=rnorm(n),
                  Employee5Var8=rnorm(n),
                  Employee5Var9=rnorm(n),
                  Employee5Var10=rnorm(n),
                  Employee5Var11=rnorm(n),
                  Employee5Var12=rnorm(n),
                  Employee5Var13=rnorm(n),
                  Employee5Var14=rnorm(n),
                  Employee5Var15=rnorm(n))

For analysis the dataset needs to have every observation in a new row, being the employer characteristics with the 15 variables linked to one employee. So the number of rows needs to increase fivefold.

I asked this question once before and while I thought this was the solution, it eventually didn't work. The suggestion was to:

library(tidyr)

X_wide <- data.frame(id = 1:3, P1 = 4:6, P2 = 7:9, P3 = 10:12)
X_long <- pivot_longer(X_wide, cols = P1:P3, names_to = "person", values_to = "score")
X_long <- as.data.frame(X_long)

This does not work as it would collapse all observation-columns in one new variable. All the data needs to remain the same, it only needs to land in a new spot. To clarify, this is the structure that I would need for analysis.

Col1	Col2	Col3	Col4	Col5	Col6
EmployerVar1	EmployerVar2	EmployerVar3	Employee1Var1	Employee1Var2	Employee1Var3
EmployerVar1	EmployerVar2	EmployerVar3	Employee2Var1	Employee2Var2	Employee2Var3
EmployerVar1	EmployerVar2	EmployerVar3	Employee3Var1	Employee3Var2	Employee3Var3

Andrea M · Accepted Answer

You're trying to reshape the data slightly longer.

This is not a simple pivot_longer() because:

There are some columns that you don't need to reshape - the only columns that need reshaping start with Employee. You can fix this by using starts_with().
The columns that need reshaping contain 2 parts: the first part (Employee) needs to create new rows, but the second part (Question) needs to stay in separate columns. For this, you can use separate() to split the two parts, and pivot_wider().

To show you a possible solution, I recreated a minimum reproducible dataset from scratch.

The problem can be simplified to a minimum dataset of:

two rows (one for employer)
two columns that will stay the same
four columns that will need reshaping (2 per employee and 2 per question)

The sample data you provided is different to your actual data: you used rnorm() to simulate all columns, but you said in the comments that your dataset actually contains mostly categorical data. This is relevant because having different data types will complicate the pivotting. I made an assumption that your dataset contains only categorical or numeric columns.

I also slightly changed Var into .Qst so that it was clearer to me what they meant, and it was easier to split the employee part from the question number part.

library(tidyr)
library(dplyr)

dat <- data.frame(
  EmployerVar1 = c("a", "b"),
  EmployerVar2 = c("d", "e"),
  Employee1.Qst1 = c("A", "B"),
  Employee1.Qst2 = c(1.5, 2.5),
  Employee2.Qst1 = c("F", "G"),
  Employee2.Qst2 = c(-6,-7)
)

dat |> 
  # Change all numerical columns to character
  mutate(across(where(is.double), as.character)) |> 
  # Pivot longer
  pivot_longer(cols = starts_with("Employee"),
               names_to = "Employee.Qst") |>
  # Split 
  separate(col = Employee.Qst, c("Employee", "Question")) |> 
  # Pivot wider
  pivot_wider(names_from = Question, 
              values_from = value)

PS I really recommend going through the pivoting vignette from tidyr, it's really helpful.

How can I make new observations from columns in R?

Answers (1)

Related Questions