Vinnie De Craim
Vinnie De Craim

Reputation: 37

How can I make new observations from columns in R?

I received a dataset from a survey, and it has a rather strange structure. Every row has 90 variables, with the first 15 being general questions and the other 75 being 5 sets of 15 variables, that are actually observations. The concrete case is an employer being asked questions about employees. It holds the following structure:

set.seed(42) 
n <- 100
data <- data.frame(
                  EmployerVar1=1:n,,
                  EmployerVar2=rnorm(n),
                  EmployerVar3=rnorm(n),
                  EmployerVar4=rnorm(n),
                  EmployerVar5=rnorm(n),
                  EmployerVar6=rnorm(n),
                  EmployerVar7=rnorm(n),
                  EmployerVar8=rnorm(n),
                  EmployerVar9=rnorm(n),
                  EmployerVar10=rnorm(n),
                  EmployerVar11=rnorm(n),
                  EmployerVar12=rnorm(n),
                  EmployerVar13=rnorm(n),
                  EmployerVar14=rnorm(n),
                  EmployerVar15=rnorm(n),
                  Employee1Var1=rnorm(n),
                  Employee1Var2=rnorm(n),
                  Employee1Var3=rnorm(n),
                  Employee1Var4=rnorm(n),
                  Employee1Var5=rnorm(n),
                  Employee1Var6=rnorm(n),
                  Employee1Var7=rnorm(n),
                  Employee1Var8=rnorm(n),
                  Employee1Var9=rnorm(n),
                  Employee1Var10=rnorm(n),
                  Employee1Var11=rnorm(n),
                  Employee1Var12=rnorm(n),
                  Employee1Var13=rnorm(n),
                  Employee1Var14=rnorm(n),
                  Employee1Var15=rnorm(n),
                  Employee2Var1=rnorm(n),
                  Employee2Var2=rnorm(n),
                  Employee2Var3=rnorm(n),
                  Employee2Var4=rnorm(n),
                  Employee2Var5=rnorm(n),
                  Employee2Var6=rnorm(n),
                  Employee2Var7=rnorm(n),
                  Employee2Var8=rnorm(n),
                  Employee2Var9=rnorm(n),
                  Employee2Var10=rnorm(n),
                  Employee2Var11=rnorm(n),
                  Employee2Var12=rnorm(n),
                  Employee2Var13=rnorm(n),
                  Employee2Var14=rnorm(n),
                  Employee2Var15=rnorm(n),
                  Employee3Var1=rnorm(n),
                  Employee3Var2=rnorm(n),
                  Employee3Var3=rnorm(n),
                  Employee3Var4=rnorm(n),
                  Employee3Var5=rnorm(n),
                  Employee3Var6=rnorm(n),
                  Employee3Var7=rnorm(n),
                  Employee3Var8=rnorm(n),
                  Employee3Var9=rnorm(n),
                  Employee3Var10=rnorm(n),
                  Employee3Var11=rnorm(n),
                  Employee3Var12=rnorm(n),
                  Employee3Var13=rnorm(n),
                  Employee3Var14=rnorm(n),
                  Employee3Var15=rnorm(n),
                  Employee4Var1=rnorm(n),
                  Employee4Var2=rnorm(n),
                  Employee4Var3=rnorm(n),
                  Employee4Var4=rnorm(n),
                  Employee4Var5=rnorm(n),
                  Employee4Var6=rnorm(n),
                  Employee4Var7=rnorm(n),
                  Employee4Var8=rnorm(n),
                  Employee4Var9=rnorm(n),
                  Employee4Var10=rnorm(n),
                  Employee4Var11=rnorm(n),
                  Employee4Var12=rnorm(n),
                  Employee4Var13=rnorm(n),
                  Employee4Var14=rnorm(n),
                  Employee4Var15=rnorm(n),
                  Employee5Var1=rnorm(n),
                  Employee5Var2=rnorm(n),
                  Employee5Var3=rnorm(n),
                  Employee5Var4=rnorm(n),
                  Employee5Var5=rnorm(n),
                  Employee5Var6=rnorm(n),
                  Employee5Var7=rnorm(n),
                  Employee5Var8=rnorm(n),
                  Employee5Var9=rnorm(n),
                  Employee5Var10=rnorm(n),
                  Employee5Var11=rnorm(n),
                  Employee5Var12=rnorm(n),
                  Employee5Var13=rnorm(n),
                  Employee5Var14=rnorm(n),
                  Employee5Var15=rnorm(n))

For analysis the dataset needs to have every observation in a new row, being the employer characteristics with the 15 variables linked to one employee. So the number of rows needs to increase fivefold.

I asked this question once before and while I thought this was the solution, it eventually didn't work. The suggestion was to:

library(tidyr)

X_wide <- data.frame(id = 1:3, P1 = 4:6, P2 = 7:9, P3 = 10:12)
X_long <- pivot_longer(X_wide, cols = P1:P3, names_to = "person", values_to = "score")
X_long <- as.data.frame(X_long)

This does not work as it would collapse all observation-columns in one new variable. All the data needs to remain the same, it only needs to land in a new spot. To clarify, this is the structure that I would need for analysis.

Col1 Col2 Col3 Col4 Col5 Col6
EmployerVar1 EmployerVar2 EmployerVar3 Employee1Var1 Employee1Var2 Employee1Var3
EmployerVar1 EmployerVar2 EmployerVar3 Employee2Var1 Employee2Var2 Employee2Var3
EmployerVar1 EmployerVar2 EmployerVar3 Employee3Var1 Employee3Var2 Employee3Var3

Upvotes: 1

Views: 495

Answers (1)

Andrea M
Andrea M

Reputation: 2491

You're trying to reshape the data slightly longer.

This is not a simple pivot_longer() because:

  1. There are some columns that you don't need to reshape - the only columns that need reshaping start with Employee. You can fix this by using starts_with().
  2. The columns that need reshaping contain 2 parts: the first part (Employee) needs to create new rows, but the second part (Question) needs to stay in separate columns. For this, you can use separate() to split the two parts, and pivot_wider().

To show you a possible solution, I recreated a minimum reproducible dataset from scratch.

The problem can be simplified to a minimum dataset of:

  • two rows (one for employer)
  • two columns that will stay the same
  • four columns that will need reshaping (2 per employee and 2 per question)

The sample data you provided is different to your actual data: you used rnorm() to simulate all columns, but you said in the comments that your dataset actually contains mostly categorical data. This is relevant because having different data types will complicate the pivotting. I made an assumption that your dataset contains only categorical or numeric columns.

I also slightly changed Var into .Qst so that it was clearer to me what they meant, and it was easier to split the employee part from the question number part.

library(tidyr)
library(dplyr)

dat <- data.frame(
  EmployerVar1 = c("a", "b"),
  EmployerVar2 = c("d", "e"),
  Employee1.Qst1 = c("A", "B"),
  Employee1.Qst2 = c(1.5, 2.5),
  Employee2.Qst1 = c("F", "G"),
  Employee2.Qst2 = c(-6,-7)
)

dat |> 
  # Change all numerical columns to character
  mutate(across(where(is.double), as.character)) |> 
  # Pivot longer
  pivot_longer(cols = starts_with("Employee"),
               names_to = "Employee.Qst") |>
  # Split 
  separate(col = Employee.Qst, c("Employee", "Question")) |> 
  # Pivot wider
  pivot_wider(names_from = Question, 
              values_from = value)

PS I really recommend going through the pivoting vignette from tidyr, it's really helpful.

Upvotes: 1

Related Questions