TheSciGuy
TheSciGuy

Reputation: 1196

Create new column based on if rows from 2 data frames match

This seems simple enough, but can't figure it out. I'd like to create a new column in df2 (impute_id) that identifies whether or not the value (measurement) was imputed or if it is the raw, observed value from df1. If the rows match, then in the new column in df2, impute_id, assign the string observed and if the rows do not match, then assign the string imputed. I'd like to do this using dplyr if possible. Also to note, the rows in the data frames may not be in the same order even though they are in the example.


Example

Raw data

df1
   time protocol     measurement_type sample measurement
1     0     HPLC cis,cis-Muconic acid      a     0.57561
2     0     HPLC            D-Glucose      a          NA
3     0     HPLC cis,cis-Muconic acid      a          NA
4     0     HPLC            D-Glucose      b          NA
5     0    OD600      Optical Density      b     0.14430
6    22     HPLC cis,cis-Muconic acid      b          NA
7    22     HPLC            D-Glucose      a          NA
8    22    OD600      Optical Density      a          NA
9    24     HPLC cis,cis-Muconic acid      a          NA
10   24     HPLC            D-Glucose      b    33.95529

Imputed Data

df2
   time protocol     measurement_type sample measurement
1     0     HPLC cis,cis-Muconic acid      a     0.57561
2     0     HPLC            D-Glucose      a    33.95529
3     0     HPLC cis,cis-Muconic acid      a     0.57561
4     0     HPLC            D-Glucose      b    33.95529
5     0    OD600      Optical Density      b     0.14430
6    22     HPLC cis,cis-Muconic acid      b     0.57561
7    22     HPLC            D-Glucose      a    33.95529
8    22    OD600      Optical Density      a     0.14430
9    24     HPLC cis,cis-Muconic acid      a     0.57561
10   24     HPLC            D-Glucose      b    33.95529

Desired Output

df2
   time protocol     measurement_type sample measurement  impute_id
1     0     HPLC cis,cis-Muconic acid      a     0.57561   observed
2     0     HPLC            D-Glucose      a    33.95529    imputed
3     0     HPLC cis,cis-Muconic acid      a     0.57561    imputed
4     0     HPLC            D-Glucose      b    33.95529    imputed
5     0    OD600      Optical Density      b     0.14430   observed
6    22     HPLC cis,cis-Muconic acid      b     0.57561    imputed
7    22     HPLC            D-Glucose      a    33.95529    imputed
8    22    OD600      Optical Density      a     0.14430    imputed
9    24     HPLC cis,cis-Muconic acid      a     0.57561    imputed
10   24     HPLC            D-Glucose      b    33.95529   observed

Reproducible Data

Raw Data

df1 <- structure(list(time = c(0L, 0L, 0L, 0L, 0L, 22L, 22L, 22L, 24L, 
24L), protocol = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L), .Label = c("HPLC", "OD600"), class = "factor"), measurement_type = structure(c(1L, 
2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("cis,cis-Muconic acid", 
"D-Glucose", "Optical Density"), class = "factor"), sample = c("a", 
"a", "a", "b", "b", "b", "a", "a", "a", "b"), measurement = c(0.57561, 
NA, NA, NA, 0.1443, NA, NA, NA, NA, 33.95529)), row.names = c(NA, 
-10L), class = "data.frame")

Imputed Data

df2 <- structure(list(time = c(0L, 0L, 0L, 0L, 0L, 22L, 22L, 22L, 24L, 
24L), protocol = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L), .Label = c("HPLC", "OD600"), class = "factor"), measurement_type = structure(c(1L, 
2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("cis,cis-Muconic acid", 
"D-Glucose", "Optical Density"), class = "factor"), sample = c("a", 
"a", "a", "b", "b", "b", "a", "a", "a", "b"), measurement = c(0.57561, 
33.95529, 0.57561, 33.95529, 0.1443, 0.57561, 33.95529, 0.1443, 
0.57561, 33.95529)), row.names = c(NA, -10L), class = "data.frame")

Upvotes: 2

Views: 56

Answers (1)

Matt
Matt

Reputation: 2987

Maybe something like

library(dplyr)

df1 %>%
  group_by(measurement_type) %>%
  mutate(impute_id = ifelse(is.na(measurement), "imputed", "observed"),
         measurement = min(measurement, na.rm = TRUE))

   time protocol     measurement_type sample measurement  impute_id
1     0     HPLC cis,cis-Muconic acid      a     0.57561 observed
2     0     HPLC            D-Glucose      a    33.95529  imputed
3     0     HPLC cis,cis-Muconic acid      a     0.57561  imputed
4     0     HPLC            D-Glucose      b    33.95529  imputed
5     0    OD600      Optical Density      b     0.14430 observed
6    22     HPLC cis,cis-Muconic acid      b     0.57561  imputed
7    22     HPLC            D-Glucose      a    33.95529  imputed
8    22    OD600      Optical Density      a     0.14430  imputed
9    24     HPLC cis,cis-Muconic acid      a     0.57561  imputed
10   24     HPLC            D-Glucose      b    33.95529 observed

Upvotes: 1

Related Questions