R Data Wrangling using Gather function

Question

I'm in the long process of combining several datasets into one cleaner datasets. There's site info, parameter info, the result, and a quality control value. My goal is to have the data in the following basic format:

Site_Info  Date     Parameter     Result   QC_value
K1          1/1/12     Aluminum    14.2       NA
K1          1/1/12     TKN         27.8       B3
K8          2/1/18     Iron        0.001      Calc

One of the datasets I'm feeding in is very wide and has a column for every parameter/result and QC value, resulting in 168 columns. The format is similar to this

Site Info  Date      Aluminum  Aluminum_QC    Iron    Iron_QC  TKN    TKN_QC
K1          1/1/12     14.2      NA            NA       NA      27.8  B3
K8          2/1/18     NA        NA            0.001    Calc    NA    NA

So far, I've been able to use the gather function to get a dataset that is close to what I want. All the parameter and result columns are what they should be, but I'm having problems with the related QC columns.

# site info and QC columns to be excluded from gather
ltrm_qual<-c(1:7,10,12,15,17,19,21,23:51,53,55,57,59,61,63,65:76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122:125,127)

LTRM<-gather(LTRM, Parameter, Result, -ltrm_qual)

This results in a dataset similar to this structure. The QC columns are all separated and being treated similar to site info.

Site_Info  Date      Parameter     Result     Aluminum_QC  Iron_QC  TKN_QC
K1          1/1/12     Aluminum     14.2      NA            NA        B3
K1          1/1/12     TKN          27.8      NA            NA        B3
K8          2/1/18     Iron         0.001     NA            Calc      NA

I want to unite the QC columns correctly into one column as identified by the parameters. I know I could work with this result to get one unified QC column by querying the QC column using the parameter. However, I'm sure there's a better method.

Len Greski · Accepted Answer

Here is an answer with the tidyverse...

textFile <- "SiteInfo  Date      Aluminum  Aluminum_QC    Iron    Iron_QC  TKN    TKN_QC
K1          1/1/12     14.2      NA            NA       NA      27.8  B3
K8          2/1/18     NA        NA            0.001    Calc    NA    NA"

data <- read.table(text = textFile,header = TRUE,stringsAsFactors = FALSE)

library(tidyr)
library(dplyr)
narrowMeasures <- data %>% group_by(SiteInfo,Date) %>% 
     select(-contains("_QC")) %>%
     pivot_longer(.,-c(SiteInfo,Date),names_to = "Measure",values_to = "Result")

# pivot the QC data and join with the metals measurements 
data %>% group_by(SiteInfo,Date) %>% 
     select(SiteInfo,Date,contains("_QC")) %>%
     pivot_longer(.,-c(SiteInfo,Date),names_to = "Measure",values_to = "QC_value") %>%
     mutate(Measure = sub("_QC","",Measure)) %>%
     inner_join(.,narrowMeasures)

...and the output:

# A tibble: 6 x 5
# Groups:   SiteInfo, Date [2]
  SiteInfo Date   Measure  QC_value Result
                 
1 K1       1/1/12 Aluminum NA       14.2  
2 K1       1/1/12 Iron     NA       NA    
3 K1       1/1/12 TKN      B3       27.8  
4 K8       2/1/18 Aluminum NA       NA    
5 K8       2/1/18 Iron     Calc      0.001
6 K8       2/1/18 TKN      NA       NA    
>

We can remove the rows where both Result and QC_value are NA by adding filter() at the end of the second pipeline as follows.

 %>% filter(!(is.na(QC_value) & is.na(Result)))

...and the final output:

Joining, by = c("SiteInfo", "Date", "Measure")
# A tibble: 3 x 5
# Groups:   SiteInfo, Date [2]
  SiteInfo Date   Measure  Result QC_value
                 
1 K1       1/1/12 Aluminum 14.2   NA      
2 K1       1/1/12 TKN      27.8   B3      
3 K8       2/1/18 Iron      0.001 Calc    
>

R Data Wrangling using Gather function

Answers (2)

Related Questions