Reputation: 55
Fairly new to R but from my research I believe this is a unique question.
I am working with water quality data that is provided to me by another agency in a wide format which looks like so:
library(tidyverse)
library(hablar)
wq.data<-as.data.frame(read.csv('C:/WQdata.csv',stringsAsFactors = FALSE))
wq.data<-wq.data%>%
convert(chr(1:8))
wq.data
# A tibble: 9 x 8
Station.ID SampleTime AreaName Sample.Depth.m Temp.Mid.degC Temp.Mid.Q DO.mg.l DO.Q
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 220 07/09/2019 14:44 Blue River 1.5 29.69 "" 1.66 ""
2 220 06/10/2019 14:17 Blue River 2.1 28.67 K 3.62 ""
3 220 05/22/2019 14:31 Blue River 1.7 29.63 "" 7.66 LP
4 220 04/08/2019 14:15 Blue River 1.7 25.9 "" 6.27 ""
5 220 03/13/2019 14:19 Blue River 1.4 23.04 K 5.51 ""
6 220 02/13/2019 14:34 Blue River 2 20.17 "" 6.05 L
7 220 01/15/2019 14:16 Blue River 1.7 18.75 MK 5.45 ""
8 220 12/10/2018 14:53 Blue River 1.1 20.18 "" 5.64 ""
9 220 11/14/2018 14:43 Blue River 1.1 27.43 "" 5.62 ""
I converted to chr
to avoid issues of the data having no common type. You will see that besides the observation data included in Station.ID
, SampleTime
, and AreaName
, the rest of the data includes both the parameter and units in the column name, which I want to extract. Most columns have an associated column for which the name always ends with Q
(i.e. Temp.Mid.Q
), which stands for Qualifier and is the QA/QC codes which match that specific observation and are important to keep with each observation.
This is only a subset, the actual data is much wider with many more parameters.
Ideally, I am looking for an end product like this:
Station.ID SampleTime AreaName Parameter Units Value Qualifier
1 220 07/09/2019 14:44 Blue River Sample.Depth m 1.50
2 220 07/09/2019 14:44 Blue River Temp.Mid degC 29.69
3 220 07/09/2019 14:44 Blue River DO mg.l 1.66
4 220 06/10/2019 14:17 Blue River Sample.Depth m 2.10
5 220 06/10/2019 14:17 Blue River Temp.Mid degC 28.67 K
6 220 06/10/2019 14:17 Blue River DO mg.l 3.62
7 220 05/22/2019 14:31 Blue River Sample.Depth m 1.70
8 220 05/22/2019 14:31 Blue River Temp.Mid degC 29.63
9 220 05/22/2019 14:31 Blue River DO mg.l 7.66 LP
I thought this question on stacking/melting multiple columns would help, but there the column names at least have a standard value
and year
. I am not familiar with regex syntax which may or may not be helpful here.
So, since I need to extract the parameters from my column names, I have resorted to attempting to do this "by hand" using an intermediary spec
data frame as described near the end of the Tidyr Pivoting Vignette.
I began by creating the following spec
:
spec<-tribble(
~.name, ~.value, ~parameter, ~units,
"Sample.Depth.m", "value", "Sample.Depth", "m",
"Temp.Mid.degC", "value", "Temp.Mid", "deg.C",
"DO.mg.l", "value", "DO", "mg.l",
"Temp.Mid.Q", "Qual", NA, NA,
"DO.Q", "Qual", NA, NA,
)
However, perhaps obviously, when I execute:
wq.data.longer<-pivot_longer_spec(wq.data,spec)
I am returned with an extra row with NA
inserted but the actual Qualifier character is absent from Qual
and no longer associated with a specific observation:
wq.data.longer
# A tibble: 36 x 7
Station.ID SampleTime AreaName parameter units value Qual
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 220 07/09/2019 14:44 Blue River Sample.Depth m 1.5 NA
2 220 07/09/2019 14:44 Blue River Temp.Mid deg.C 29.69 NA
3 220 07/09/2019 14:44 Blue River DO mg.l 1.66 NA
4 220 07/09/2019 14:44 Blue River NA NA NA ""
5 220 06/10/2019 14:17 Blue River Sample.Depth m 2.1 NA
6 220 06/10/2019 14:17 Blue River Temp.Mid deg.C 28.67 NA
7 220 06/10/2019 14:17 Blue River DO mg.l 3.62 NA
8 220 06/10/2019 14:17 Blue River NA NA NA ""
9 220 05/22/2019 14:31 Blue River Sample.Depth m 1.7 NA
10 220 05/22/2019 14:31 Blue River Temp.Mid deg.C 29.63 NA
# ... with 26 more rows
Recall my Qualifier would ideally show here in row 6 (for the above view) if all was correct:
Station.ID SampleTime AreaName parameter units value Qual
6 220 06/10/2019 14:17 Blue River Temp.Mid deg.C 28.67 K
This may be a fairly simple solution but I haven't been able to make any progress. Any assistance is very much appreciated. I haven't found evidence of many others using pivot_longer_spec
nor much more guidance from the Tidyr vignette so I'm generally curious how to get farther with this method.
Many thanks!
Upvotes: 4
Views: 696
Reputation: 2987
It seems like you just need to add the names for the parameter
and units
columns in your spec
rather than NA
spec <- tribble(
~.name, ~ .value, ~ parameter, ~ units,
"Sample.Depth.m", "value", "Sample.Depth", "m",
"Temp.Mid.degC", "value", "Temp.Mid", "deg.C",
"DO.mg.l", "value", "DO", "mg.l",
"Temp.Mid.Q", "Qual", "Temp.Mid", "deg.C",
"DO.Q", "Qual", "DO", "mg.1",
)
pivot_longer_spec(df, spec)
# A tibble: 36 x 7
# Station.ID SampleTime AreaName parameter units value Qual
# <int> <fct> <fct> <chr> <chr> <dbl> <chr>
# 1 220 7/9/2019 Blue River Sample.Depth m 1.5 NA
# 2 220 7/9/2019 Blue River Temp.Mid deg.C 29.7 ""
# 3 220 7/9/2019 Blue River DO mg.l 1.66 NA
# 4 220 7/9/2019 Blue River DO mg.1 NA ""
# 5 220 6/10/2019 Blue River Sample.Depth m 2.1 NA
6 220 6/10/2019 Blue River Temp.Mid deg.C 28.7 K
# 7 220 6/10/2019 Blue River DO mg.l 3.62 NA
# 8 220 6/10/2019 Blue River DO mg.1 NA ""
# 9 220 5/22/2019 Blue River Sample.Depth m 1.7 NA
# 10 220 5/22/2019 Blue River Temp.Mid deg.C 29.6 ""
df <- structure(list(Station.ID = c(220L, 220L, 220L, 220L, 220L,
220L, 220L, 220L, 220L), SampleTime = structure(c(9L, 8L, 7L,
6L, 5L, 4L, 1L, 3L, 2L), .Label = c("1/15/2019", "11/14/2018",
"12/10/2018", "2/13/2019", "3/13/2019", "4/8/2019", "5/22/2019",
"6/10/2019", "7/9/2019"), class = "factor"), X = structure(c(8L,
3L, 5L, 1L, 4L, 6L, 2L, 9L, 7L), .Label = c("14:15", "14:16",
"14:17", "14:19", "14:31", "14:34", "14:43", "14:44", "14:53"
), class = "factor"), AreaName = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "Blue River", class = "factor"),
Sample.Depth.m = c(1.5, 2.1, 1.7, 1.7, 1.4, 2, 1.7, 1.1,
1.1), Temp.Mid.degC = c(29.69, 28.67, 29.63, 25.9, 23.04,
20.17, 18.75, 20.18, 27.43), Temp.Mid.Q = structure(c(1L,
2L, 1L, 1L, 2L, 1L, 3L, 1L, 1L), .Label = c("", "K",
"MK"), class = "factor"), DO.mg.l = c(1.66, 3.62, 7.66, 6.27,
5.51, 6.05, 5.45, 5.64, 5.62), DO.Q = c("", "", "LP",
"", "", "L", "", "", "")), row.names = c(NA,
-9L), class = "data.frame")
Upvotes: 1