Reputation: 37
I have the following example dataframe in R:
SampleID <- c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "E", "E", "E", "E", "F", "F")
Analyte <- c("A1", "A1", "A2", "A2", "B1", "B2", "C1", "C1", "C1", "C2", "C2", "C2", "D1", "D2", "E1", "E1", "E2", "E2", "F1", "F2")
Fraction <- c("Dissolved", "Total", "Dissolved", "Total", "Total", "Total", "Dissolved", "Suspended", "Total", "Dissolved", "Suspended", "Total", "Unknown", "Unknown", "Dissolved", "Suspended", "Dissolved", "Suspended", "Dissolved", "Dissolved")
Concentration <- c(4.2, 5.6, 8.6, 11.2, 2.1, 9.6, 15.6, 28.7, 42.3, 18.3, 23.2, 48.6, 6.4, 28.8, 9.1, 32.5, 36.4, 24.5, 10.7, 3.4)
MyData <- data.frame(SampleID, Analyte, Fraction, Concentration)
MyData
SampleID Analyte Fraction Concentration
1 A A1 Dissolved 4.2
2 A A1 Total 5.6
3 A A2 Dissolved 8.6
4 A A2 Total 11.2
5 B B1 Total 2.1
6 B B2 Total 9.6
7 C C1 Dissolved 15.6
8 C C1 Suspended 28.7
9 C C1 Total 42.3
10 C C2 Dissolved 18.3
11 C C2 Suspended 23.2
12 C C2 Total 48.6
13 D D1 Unknown 6.4
14 D D2 Unknown 28.8
15 E E1 Dissolved 9.1
16 E E1 Suspended 32.5
17 E E2 Dissolved 36.4
18 E E2 Suspended 24.5
19 F F1 Dissolved 10.7
20 F F2 Dissolved 3.4
I would like to do the following:
For each SampleID
, if an Analyte
has a "Total" Fraction
reported, retain only that row for the Analyte
and remove rows with any other Fraction
value (i.e., Dissolved, Suspended) for that Analyte
.
If an Analyte
for a SampleID
includes both Dissolved and Suspended in the Fraction
column (and no other values for Fraction
), sum the concentrations for Dissolved and Suspended and add a row for that Analyte
with the Fraction
column labeled Total and the Concentration
column listing the sum. Remove the original rows for Dissolved and Suspended for that Analyte
.
So for the dataframe above, the two Analytes
of SampleID
"A" have Dissolved and Total, so I would want to remove the rows with the Dissolved Fraction
. For SampleID
"C", I would want to remove the Dissolved and Suspended Fractions
of both Analytes
and just keep the rows with Total. And lastly, for SampleID
"E", the Dissolved and Suspended Fractions
for each of the two Analytes
would be summed together and the result would be a new row for each Analyte
that represents the sum (relabeled as Total), and the rows associated with the Dissolved and Suspended Fractions
would be removed.
The output of the above dataframe MyData
would be the following:
SampleID Analyte Fraction Concentration
2 A A1 Total 5.6
4 A A2 Total 11.2
5 B B1 Total 2.1
6 B B2 Total 9.6
9 C C1 Total 42.3
12 C C2 Total 48.6
13 D D1 Unknown 6.4
14 D D2 Unknown 28.8
15 E E1 Total 41.6
17 E E2 Total 60.9
19 F F1 Dissolved 10.7
20 F F2 Dissolved 3.4
Note that the example I have provided is just a small subset of a much larger dataset that includes hundreds of SampleIDs
, but the Fraction
column can only equal the values listed in the original dataframe above (i.e., Dissolved, Suspended, Total, or Unknown).
Thank you!
Upvotes: 1
Views: 66
Reputation: 21938
You can also use the following solution. It may sound a bit verbose but will also get the job done:
library(dplyr)
library(purrr)
MyData %>%
group_split(SampleID, Analyte) %>%
map(~ if("Total" %in% .x$Fraction) {
.x %>% filter(Fraction == "Total")} else {
.x
}) %>%
map(~ if(all(c("Dissolved", "Suspended") %in% .x$Fraction)) {
add_row(.x, SampleID = .x$SampleID[1], Analyte = .x$Analyte[1],
Fraction = "Total", Concentration = sum(.x$Concentration))
} else {
.x
}) %>%
map_dfr(~ if("Total" %in% .x$Fraction) {
.x %>% filter(Fraction == "Total")} else {
.x
})
# A tibble: 12 x 4
SampleID Analyte Fraction Concentration
<chr> <chr> <chr> <dbl>
1 A A1 Total 5.6
2 A A2 Total 11.2
3 B B1 Total 2.1
4 B B2 Total 9.6
5 C C1 Total 42.3
6 C C2 Total 48.6
7 D D1 Unknown 6.4
8 D D2 Unknown 28.8
9 E E1 Total 41.6
10 E E2 Total 60.9
11 F F1 Dissolved 10.7
12 F F2 Dissolved 3.4
Upvotes: 1
Reputation: 79338
This could be done as:
library(tidyverse)
MyData %>%
pivot_wider(c(SampleID, Analyte),Fraction, values_from = Concentration) %>%
mutate(Total = coalesce(Total, Dissolved + Suspended),
Dissolved = ifelse(is.na(Total)&is.na(Suspended), Dissolved, NA),
Suspended = ifelse(is.na(Total)&is.na(Dissolved), Suspended, NA)) %>%
pivot_longer(-c(SampleID, Analyte), values_drop_na = TRUE)
# A tibble: 12 x 4
SampleID Analyte name value
<chr> <chr> <chr> <dbl>
1 A A1 Total 5.6
2 A A2 Total 11.2
3 B B1 Total 2.1
4 B B2 Total 9.6
5 C C1 Total 42.3
6 C C2 Total 48.6
7 D D1 Unknown 6.4
8 D D2 Unknown 28.8
9 E E1 Total 41.6
10 E E2 Total 60.9
11 F F1 Dissolved 10.7
12 F F2 Dissolved 3.4
Upvotes: 0