amisos55
amisos55

Reputation: 1979

Extracting time information from a raw dataset in R

I would like to extract ID, item information and time information from an unstructured dataset. Here is my sample dataset looks like:

df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
                            "Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
                            "Item 1: Time in seconds","Item 1: Errors",
                            "Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
                            "Item 2: Time in seconds","Item 2: Errors"),
                  School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
                             0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
                 X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
                               NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
                 School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
                 Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))


> df
                    Text_1                School.2 X_Elementary_School..3                School.4 Y_Elementary_School..2
1                  Scoring                Teacher:                  Bill:                Teacher:                  John:
2            1 = Incorrect                DC Name:             X District                DC Name:             X District
3                    Text1      Date (mm/dd/yyyy):                10/7/21      Date (mm/dd/yyyy):                11/7/21
4                    Text2            Child Grade:                      K            Child Grade:                      K
5                    Text3       Student Study ID:            123-2222-2:       Student Study ID:            112-1111-3:
6                    Text4                    <NA>                   <NA>                    <NA>                   <NA>
7     Demo 1: Color Naming                    <NA>                   <NA>                       0                   <NA>
8                 Amarillo                    <NA>                   <NA>                    <NA>                   <NA>
9                     Azul                    <NA>                   <NA>                       1                   <NA>
10                   Verde                    <NA>                   <NA>                    <NA>                   <NA>
11                    Azul                    <NA>                   <NA>                    <NA>                   <NA>
12          Demo 1: Errors                       0                   <NA>                       0                   <NA>
13    Item 1: Color naming 1 = Incorrect responses         Child response 1 = Incorrect responses         Child response
14                Amarillo                       0                   <NA>                       0                   <NA>
15                    Azul                       1                   <NA>                       1                   <NA>
16                   Verde                    <NA>                   <NA>                    <NA>                   <NA>
17                    Azul                    <NA>                   <NA>                    <NA>                   <NA>
18 Item 1: Time in seconds                    <NA>                   <NA>                     120                   <NA>
19          Item 1: Errors                       0                   <NA>                       0                   <NA>
20    Item 2: Shape Naming 1 = Incorrect responses         Child response 1 = Incorrect responses         Child response
21         Cuadrado/Cuadro                       0                   <NA>                    <NA>                   <NA>
22         Cuadrado/Cuadro                    <NA>                   <NA>                       1                   <NA>
23                 Círculo                    <NA>                   <NA>                       0                   <NA>
24                Estrella                       1                   <NA>                       1                   <NA>
25                 Círculo                       1                   <NA>                    <NA>                   <NA>
26               Triángulo                       0                   <NA>                       1                   <NA>
27 Item 2: Time in seconds                    <NA>                   <NA>                     110                   <NA>
28          Item 2: Errors                       0                   <NA>                       0                   <NA>

Here, When the first column has encounters Item #: Time in seconds, I would like to keep corresponding value under School.2 and School.4 columns. So the first students has those Item 1 and Item 2 time information as empty cell so they are NA. The second student has those time information as 120 and 110. These two students have two items in the example dataset.

There are multiple columns for the real dataset so I need something to be in a generalized loop.

My desired output would be:

> time
          id itemid time
1 123-2222-2 Item 1   NA
2 123-2222-2 Item 2   NA
3 112-1111-3 Item 1  120
4 112-1111-3 Item 2  110

This is my attempt but I could not add the id yet.

time.data <- df %>%
  filter(str_detect(Text_1, 'Time in seconds')) # %>%
# select(time = 4)

select_time_cols <- seq(from = 2, to = ncol(time.data), by = 2)

time <- time.data %>%
  select(time = select_time_cols)

time.t<-as.data.frame(t(time))
rownames(time.t)<-seq(1,nrow(time.t),1)
colnames(time.t)<-paste0('i',seq(1,ncol(time.t),1))
time.t<-apply(time.t,2,as.numeric)
time.t<-as.data.frame(time.t)

> time.t
  item1 item2
1    NA    NA
2   120   110

Upvotes: 1

Views: 45

Answers (1)

Silentdevildoll
Silentdevildoll

Reputation: 1280

I came up with a solution with your test dataset, though this isn't particularly robust or elegant, so there will likely be better options out there. The key thing I did was shift the text3 row over, so that the id and time were in the same column. I put notes in the code to illuminate my steps. Hopefully this points you in the right direction or prompts those with better coding skills than my own!

library(dplyr)
library(tidyr)

df2<-df%>% #Turning all to character, so that I can move the row without interfering with factor level
  mutate_all(as.character)
df2[5, 2:(ncol(df2) - 1)] <- df2[5, 3:ncol(df2)] #shifting row 5, which is the "Text3" that has studentID to be the same column as the times

df3<-df2%>%
  filter(grepl("Text3|Time in seconds", Text_1))%>% #removing unnecessary rows
  mutate(type = case_when(grepl("Text", Text_1) ~ "itemid", #Relabling the Text_1 column
                          grepl("Item 1", Text_1) ~ "1",
                          grepl("Item 2", Text_1) ~ "2"))%>%
  select(grep("type|^School", names(.))) #only keeping needed columns

colnames(df3) <- df3[1,] #Taking the first row and making it the column names
df3 <- df3[-1, ] #removing row 1, since it was made into column names


df3%>%
  tidyr::pivot_longer(-itemid, names_to = "id", values_to = "time")%>% #Making the data into longer format
  select(id, itemid, time)%>% #relocating columns to match desired output
  arrange(desc(id)) #sorting to match desired output

Upvotes: 2

Related Questions