gbravo
gbravo

Reputation: 55

Using Tidyr's pivot_longer_spec for parameters stored in multiple column names for one observation

Fairly new to R but from my research I believe this is a unique question.

I am working with water quality data that is provided to me by another agency in a wide format which looks like so:

library(tidyverse)
library(hablar)

wq.data<-as.data.frame(read.csv('C:/WQdata.csv',stringsAsFactors = FALSE))

wq.data<-wq.data%>%
  convert(chr(1:8))

wq.data
# A tibble: 9 x 8
  Station.ID SampleTime       AreaName   Sample.Depth.m Temp.Mid.degC Temp.Mid.Q DO.mg.l DO.Q 
  <chr>      <chr>            <chr>      <chr>          <chr>         <chr>      <chr>   <chr>
1 220        07/09/2019 14:44 Blue River 1.5            29.69         ""         1.66    ""   
2 220        06/10/2019 14:17 Blue River 2.1            28.67         K          3.62    ""   
3 220        05/22/2019 14:31 Blue River 1.7            29.63         ""         7.66    LP   
4 220        04/08/2019 14:15 Blue River 1.7            25.9          ""         6.27    ""   
5 220        03/13/2019 14:19 Blue River 1.4            23.04         K          5.51    ""   
6 220        02/13/2019 14:34 Blue River 2              20.17         ""         6.05    L    
7 220        01/15/2019 14:16 Blue River 1.7            18.75         MK         5.45    ""   
8 220        12/10/2018 14:53 Blue River 1.1            20.18         ""         5.64    ""   
9 220        11/14/2018 14:43 Blue River 1.1            27.43         ""         5.62    ""

I converted to chr to avoid issues of the data having no common type. You will see that besides the observation data included in Station.ID, SampleTime, and AreaName, the rest of the data includes both the parameter and units in the column name, which I want to extract. Most columns have an associated column for which the name always ends with Q (i.e. Temp.Mid.Q), which stands for Qualifier and is the QA/QC codes which match that specific observation and are important to keep with each observation.

This is only a subset, the actual data is much wider with many more parameters.

Ideally, I am looking for an end product like this:

  Station.ID       SampleTime   AreaName    Parameter Units Value Qualifier
1        220 07/09/2019 14:44 Blue River Sample.Depth     m  1.50          
2        220 07/09/2019 14:44 Blue River     Temp.Mid  degC 29.69          
3        220 07/09/2019 14:44 Blue River           DO  mg.l  1.66          
4        220 06/10/2019 14:17 Blue River Sample.Depth     m  2.10          
5        220 06/10/2019 14:17 Blue River     Temp.Mid  degC 28.67         K
6        220 06/10/2019 14:17 Blue River           DO  mg.l  3.62          
7        220 05/22/2019 14:31 Blue River Sample.Depth     m  1.70          
8        220 05/22/2019 14:31 Blue River     Temp.Mid  degC 29.63          
9        220 05/22/2019 14:31 Blue River           DO  mg.l  7.66        LP

I thought this question on stacking/melting multiple columns would help, but there the column names at least have a standard value and year. I am not familiar with regex syntax which may or may not be helpful here.

So, since I need to extract the parameters from my column names, I have resorted to attempting to do this "by hand" using an intermediary spec data frame as described near the end of the Tidyr Pivoting Vignette.

I began by creating the following spec:

spec<-tribble(
  ~.name,                           ~.value,  ~parameter,          ~units,
  "Sample.Depth.m",                 "value",  "Sample.Depth",      "m",
  "Temp.Mid.degC",                  "value",  "Temp.Mid",          "deg.C",
  "DO.mg.l",                        "value",  "DO",                "mg.l",
  "Temp.Mid.Q",                     "Qual",   NA,                  NA,
  "DO.Q",                           "Qual",   NA,                  NA,
)

However, perhaps obviously, when I execute:

wq.data.longer<-pivot_longer_spec(wq.data,spec)

I am returned with an extra row with NA inserted but the actual Qualifier character is absent from Qual and no longer associated with a specific observation:

wq.data.longer
# A tibble: 36 x 7
   Station.ID SampleTime       AreaName   parameter    units value Qual 
   <chr>      <chr>            <chr>      <chr>        <chr> <chr> <chr>
 1 220        07/09/2019 14:44 Blue River Sample.Depth m     1.5   NA   
 2 220        07/09/2019 14:44 Blue River Temp.Mid     deg.C 29.69 NA   
 3 220        07/09/2019 14:44 Blue River DO           mg.l  1.66  NA   
 4 220        07/09/2019 14:44 Blue River NA           NA    NA    ""   
 5 220        06/10/2019 14:17 Blue River Sample.Depth m     2.1   NA   
 6 220        06/10/2019 14:17 Blue River Temp.Mid     deg.C 28.67 NA   
 7 220        06/10/2019 14:17 Blue River DO           mg.l  3.62  NA   
 8 220        06/10/2019 14:17 Blue River NA           NA    NA    ""   
 9 220        05/22/2019 14:31 Blue River Sample.Depth m     1.7   NA   
10 220        05/22/2019 14:31 Blue River Temp.Mid     deg.C 29.63 NA   
# ... with 26 more rows

Recall my Qualifier would ideally show here in row 6 (for the above view) if all was correct:

  Station.ID SampleTime       AreaName   parameter    units value Qual  
6 220        06/10/2019 14:17 Blue River Temp.Mid     deg.C 28.67 K

This may be a fairly simple solution but I haven't been able to make any progress. Any assistance is very much appreciated. I haven't found evidence of many others using pivot_longer_spec nor much more guidance from the Tidyr vignette so I'm generally curious how to get farther with this method.

Many thanks!

Upvotes: 4

Views: 696

Answers (1)

Matt
Matt

Reputation: 2987

It seems like you just need to add the names for the parameter and units columns in your spec rather than NA

spec <- tribble(
  ~.name, ~ .value, ~ parameter, ~ units,
  "Sample.Depth.m", "value", "Sample.Depth", "m",
  "Temp.Mid.degC", "value", "Temp.Mid", "deg.C",
  "DO.mg.l", "value", "DO", "mg.l",
  "Temp.Mid.Q", "Qual", "Temp.Mid", "deg.C",
  "DO.Q", "Qual", "DO", "mg.1",
)

pivot_longer_spec(df, spec)

# A tibble: 36 x 7
#   Station.ID SampleTime AreaName   parameter    units value Qual 
#           <int> <fct>      <fct>      <chr>        <chr> <dbl> <chr>
# 1           220 7/9/2019   Blue River Sample.Depth m      1.5  NA   
# 2           220 7/9/2019   Blue River Temp.Mid     deg.C 29.7  ""   
# 3           220 7/9/2019   Blue River DO           mg.l   1.66 NA   
# 4           220 7/9/2019   Blue River DO           mg.1  NA    ""   
# 5           220 6/10/2019  Blue River Sample.Depth m      2.1  NA   
  6           220 6/10/2019  Blue River Temp.Mid     deg.C 28.7  K    
# 7           220 6/10/2019  Blue River DO           mg.l   3.62 NA   
# 8           220 6/10/2019  Blue River DO           mg.1  NA    ""   
# 9           220 5/22/2019  Blue River Sample.Depth m      1.7  NA   
# 10           220 5/22/2019  Blue River Temp.Mid     deg.C 29.6  ""   

Data

df <- structure(list(Station.ID = c(220L, 220L, 220L, 220L, 220L, 
220L, 220L, 220L, 220L), SampleTime = structure(c(9L, 8L, 7L, 
6L, 5L, 4L, 1L, 3L, 2L), .Label = c("1/15/2019", "11/14/2018", 
"12/10/2018", "2/13/2019", "3/13/2019", "4/8/2019", "5/22/2019", 
"6/10/2019", "7/9/2019"), class = "factor"), X = structure(c(8L, 
3L, 5L, 1L, 4L, 6L, 2L, 9L, 7L), .Label = c("14:15", "14:16", 
"14:17", "14:19", "14:31", "14:34", "14:43", "14:44", "14:53"
), class = "factor"), AreaName = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "Blue River", class = "factor"), 
    Sample.Depth.m = c(1.5, 2.1, 1.7, 1.7, 1.4, 2, 1.7, 1.1, 
    1.1), Temp.Mid.degC = c(29.69, 28.67, 29.63, 25.9, 23.04, 
    20.17, 18.75, 20.18, 27.43), Temp.Mid.Q = structure(c(1L, 
    2L, 1L, 1L, 2L, 1L, 3L, 1L, 1L), .Label = c("", "K", 
    "MK"), class = "factor"), DO.mg.l = c(1.66, 3.62, 7.66, 6.27, 
    5.51, 6.05, 5.45, 5.64, 5.62), DO.Q = c("", "", "LP", 
    "", "", "L", "", "", "")), row.names = c(NA, 
-9L), class = "data.frame")

Upvotes: 1

Related Questions