R - Removing rows from dataframe based on multiple conditions for a single column

Question

I have the following example dataframe in R:

SampleID <- c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "E", "E", "E", "E", "F", "F")
Analyte <- c("A1", "A1", "A2", "A2", "B1", "B2", "C1", "C1", "C1", "C2", "C2", "C2", "D1", "D2", "E1", "E1", "E2", "E2", "F1", "F2")
Fraction <- c("Dissolved", "Total", "Dissolved", "Total", "Total", "Total", "Dissolved", "Suspended", "Total", "Dissolved", "Suspended", "Total", "Unknown", "Unknown", "Dissolved", "Suspended", "Dissolved", "Suspended", "Dissolved", "Dissolved")
Concentration <- c(4.2, 5.6, 8.6, 11.2, 2.1, 9.6, 15.6, 28.7, 42.3, 18.3, 23.2, 48.6, 6.4, 28.8, 9.1, 32.5, 36.4, 24.5, 10.7, 3.4)
MyData <- data.frame(SampleID, Analyte, Fraction, Concentration)

MyData

   SampleID Analyte  Fraction Concentration
1         A      A1 Dissolved           4.2
2         A      A1     Total           5.6
3         A      A2 Dissolved           8.6
4         A      A2     Total          11.2
5         B      B1     Total           2.1
6         B      B2     Total           9.6
7         C      C1 Dissolved          15.6
8         C      C1 Suspended          28.7
9         C      C1     Total          42.3
10        C      C2 Dissolved          18.3
11        C      C2 Suspended          23.2
12        C      C2     Total          48.6
13        D      D1   Unknown           6.4
14        D      D2   Unknown          28.8
15        E      E1 Dissolved           9.1
16        E      E1 Suspended          32.5
17        E      E2 Dissolved          36.4
18        E      E2 Suspended          24.5
19        F      F1 Dissolved          10.7
20        F      F2 Dissolved           3.4

I would like to do the following:

For each SampleID, if an Analyte has a "Total" Fraction reported, retain only that row for the Analyte and remove rows with any other Fraction value (i.e., Dissolved, Suspended) for that Analyte.
If an Analyte for a SampleID includes both Dissolved and Suspended in the Fraction column (and no other values for Fraction), sum the concentrations for Dissolved and Suspended and add a row for that Analyte with the Fraction column labeled Total and the Concentration column listing the sum. Remove the original rows for Dissolved and Suspended for that Analyte.

So for the dataframe above, the two Analytes of SampleID "A" have Dissolved and Total, so I would want to remove the rows with the Dissolved Fraction. For SampleID "C", I would want to remove the Dissolved and Suspended Fractions of both Analytes and just keep the rows with Total. And lastly, for SampleID "E", the Dissolved and Suspended Fractions for each of the two Analytes would be summed together and the result would be a new row for each Analyte that represents the sum (relabeled as Total), and the rows associated with the Dissolved and Suspended Fractions would be removed.

The output of the above dataframe MyData would be the following:

   SampleID Analyte  Fraction Concentration
2         A      A1     Total           5.6
4         A      A2     Total          11.2
5         B      B1     Total           2.1
6         B      B2     Total           9.6
9         C      C1     Total          42.3
12        C      C2     Total          48.6
13        D      D1   Unknown           6.4
14        D      D2   Unknown          28.8
15        E      E1     Total          41.6 
17        E      E2     Total          60.9
19        F      F1 Dissolved          10.7
20        F      F2 Dissolved           3.4

Note that the example I have provided is just a small subset of a much larger dataset that includes hundreds of SampleIDs, but the Fraction column can only equal the values listed in the original dataframe above (i.e., Dissolved, Suspended, Total, or Unknown).

Thank you!

Onyambu · Accepted Answer

This could be done as:

library(tidyverse)
MyData %>%
  pivot_wider(c(SampleID, Analyte),Fraction, values_from = Concentration) %>%
  mutate(Total = coalesce(Total, Dissolved + Suspended), 
         Dissolved = ifelse(is.na(Total)&is.na(Suspended), Dissolved, NA),
         Suspended = ifelse(is.na(Total)&is.na(Dissolved), Suspended, NA)) %>%
  pivot_longer(-c(SampleID, Analyte), values_drop_na = TRUE)

# A tibble: 12 x 4
   SampleID Analyte name      value
               
 1 A        A1      Total       5.6
 2 A        A2      Total      11.2
 3 B        B1      Total       2.1
 4 B        B2      Total       9.6
 5 C        C1      Total      42.3
 6 C        C2      Total      48.6
 7 D        D1      Unknown     6.4
 8 D        D2      Unknown    28.8
 9 E        E1      Total      41.6
10 E        E2      Total      60.9
11 F        F1      Dissolved  10.7
12 F        F2      Dissolved   3.4

R - Removing rows from dataframe based on multiple conditions for a single column

Answers (2)

Related Questions