Reputation: 49
I'm in the long process of combining several datasets into one cleaner datasets. There's site info, parameter info, the result, and a quality control value. My goal is to have the data in the following basic format:
Site_Info Date Parameter Result QC_value
K1 1/1/12 Aluminum 14.2 NA
K1 1/1/12 TKN 27.8 B3
K8 2/1/18 Iron 0.001 Calc
One of the datasets I'm feeding in is very wide and has a column for every parameter/result and QC value, resulting in 168 columns. The format is similar to this
Site Info Date Aluminum Aluminum_QC Iron Iron_QC TKN TKN_QC
K1 1/1/12 14.2 NA NA NA 27.8 B3
K8 2/1/18 NA NA 0.001 Calc NA NA
So far, I've been able to use the gather function to get a dataset that is close to what I want. All the parameter and result columns are what they should be, but I'm having problems with the related QC columns.
# site info and QC columns to be excluded from gather
ltrm_qual<-c(1:7,10,12,15,17,19,21,23:51,53,55,57,59,61,63,65:76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122:125,127)
LTRM<-gather(LTRM, Parameter, Result, -ltrm_qual)
This results in a dataset similar to this structure. The QC columns are all separated and being treated similar to site info.
Site_Info Date Parameter Result Aluminum_QC Iron_QC TKN_QC
K1 1/1/12 Aluminum 14.2 NA NA B3
K1 1/1/12 TKN 27.8 NA NA B3
K8 2/1/18 Iron 0.001 NA Calc NA
I want to unite the QC columns correctly into one column as identified by the parameters. I know I could work with this result to get one unified QC column by querying the QC column using the parameter. However, I'm sure there's a better method.
Upvotes: 2
Views: 60
Reputation: 10855
Here is an answer with the tidyverse...
textFile <- "SiteInfo Date Aluminum Aluminum_QC Iron Iron_QC TKN TKN_QC
K1 1/1/12 14.2 NA NA NA 27.8 B3
K8 2/1/18 NA NA 0.001 Calc NA NA"
data <- read.table(text = textFile,header = TRUE,stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
narrowMeasures <- data %>% group_by(SiteInfo,Date) %>%
select(-contains("_QC")) %>%
pivot_longer(.,-c(SiteInfo,Date),names_to = "Measure",values_to = "Result")
# pivot the QC data and join with the metals measurements
data %>% group_by(SiteInfo,Date) %>%
select(SiteInfo,Date,contains("_QC")) %>%
pivot_longer(.,-c(SiteInfo,Date),names_to = "Measure",values_to = "QC_value") %>%
mutate(Measure = sub("_QC","",Measure)) %>%
inner_join(.,narrowMeasures)
...and the output:
# A tibble: 6 x 5
# Groups: SiteInfo, Date [2]
SiteInfo Date Measure QC_value Result
<chr> <chr> <chr> <chr> <dbl>
1 K1 1/1/12 Aluminum NA 14.2
2 K1 1/1/12 Iron NA NA
3 K1 1/1/12 TKN B3 27.8
4 K8 2/1/18 Aluminum NA NA
5 K8 2/1/18 Iron Calc 0.001
6 K8 2/1/18 TKN NA NA
>
We can remove the rows where both Result and QC_value are NA by adding filter()
at the end of the second pipeline as follows.
%>% filter(!(is.na(QC_value) & is.na(Result)))
...and the final output:
Joining, by = c("SiteInfo", "Date", "Measure")
# A tibble: 3 x 5
# Groups: SiteInfo, Date [2]
SiteInfo Date Measure Result QC_value
<chr> <chr> <chr> <dbl> <chr>
1 K1 1/1/12 Aluminum 14.2 NA
2 K1 1/1/12 TKN 27.8 B3
3 K8 2/1/18 Iron 0.001 Calc
>
Upvotes: 2
Reputation: 3876
Like this?:
df %>%
pivot_longer(cols = c(5:7), names_to = c("QC"), values_drop_na = TRUE)
# A tibble: 3 x 6
Site_Info Date Parameter Result QC value
<chr> <chr> <chr> <dbl> <chr> <chr>
1 K1 01.01.2012 Aluminium 14.2 TKN_QC B3
2 K1 01.01.2012 TKN 27.8 TKN_QC B3
3 K8 02.01.2018 Iron 0.001 Iron_QC Calc
Upvotes: 1