conditional if/else statements across two data frames in R

Question

I have a data.frame of a few columns of known/expected results and want to compare these to a second data.frame with the same structure but potentially different results. There are four possible scenarios:

TruePositive (TP) == elements of the known/observed vectors contain the same information and neither are NA
TrueNegative (TN) == elements of the known/observed vectors are both NA
FalsePositive (FP) == known is NA, while observed is some nonNA value
FalseNegative (FN) == known is some nonNA value, observed is NA

I'd like to to produce a new data.frame that evaluates those four conditions for each of the columns among the known/observed data.frames

I have come across posts that describe multiple conditions within a vector, but I haven't seen any information on comparing the elements of two vectors that share the same index (in R).

Let's say for simplicity you have a pair of maps: the "True" map is what we expect, and the "observed" map is what we have recorded. For both maps, these observations occur at identical locations, and each location is broken up into three parts (Country, Region, and Zone).

true_map <- data.frame(
  MapSection = paste0("mapsection", seq(1:5)),
  Country = c(rep("Canada", 3), rep("UnitedStates", 2)),
  Region = c(rep("Ontario", 3), "NewYork", "Alaska"),
  Zone = c("Toronto", "Ottawa", NA, "Albany", NA)
)

obsrvd_map <- data.frame(
  MapSection = paste0("mapsection", seq(1:5)),
  Country = c(rep("Canada", 3), rep("UnitedStates", 2)),
  Region = c(rep("Ontario", 2), NA, "NewYork", "Alaska"),
  Zone = c("Toronto", "Ottawa", NA, "Albany", "Fairbanks")
)

At the Country level, the true and observed maps are identical. At the Region level however, mapsection3 is missing information in the "observed" map (NA), while the "true" map has a location (Ontario). This would be classified as a FalseNegative, because we are missing information where we expect it to be. Conversely, mapsection5 is missing information in the "true" map at the Zone level, while the "observed" map contains information (Fairbanks). This would be classified as a FalsePositive because there is additional information where we do not expect information. However, both the "true" and "observed" maps are lacking information in the Zone level for mapsection3. This would be considered a TrueNegative because both are missing information.

The final result would summarize these observations as follows:

MapSection Country Region Zone
mapsection1      TP     TP   TP
mapsection2      TP     TP   TP
mapsection3      TP     FN   TN
mapsection4      TP     TP   TP
mapsection5      TP     TP   FP

Thank you for your assistance!

camille · Accepted Answer

The logic in your answer looks solid, it just doesn't yet scale to the other combinations you need. To do that, I'd reshape the data into a long form so you have one column of geographic levels and one of zones.

library(dplyr)
library(tidyr)

true_map_long <- true_map %>%
  gather(key = level, value = value, -MapSection)
obsrvd_map_long <- obsrvd_map %>%
  gather(key = level, value = value, -MapSection)

Both are shaped like:

head(true_map_long)
#>    MapSection   level        value
#> 1 mapsection1 Country       Canada
#> 2 mapsection2 Country       Canada
#> 3 mapsection3 Country       Canada
#> 4 mapsection4 Country UnitedStates
#> 5 mapsection5 Country UnitedStates
#> 6 mapsection1  Region      Ontario

Join these two long-shaped tables by map section and level, and give appropriate suffixes to make it clearer which is which. The case_when is essentially the same, but now you're not tied to one location.

joined <- inner_join(
  true_map_long,
  obsrvd_map_long,
  by = c("MapSection", "level"),
  suffix = c("_t", "_o")
) %>%
  mutate(truth = case_when(
    value_t == value_o  ~ "TP",
    is.na(value_t) == is.na(value_o)  ~ "TN",
    is.na(value_t) & !is.na(value_o)  ~ "FP",
    !is.na(value_t) & is.na(value_o)  ~ "FN",
  ))
head(joined)
#>    MapSection   level      value_t      value_o truth
#> 1 mapsection1 Country       Canada       Canada    TP
#> 2 mapsection2 Country       Canada       Canada    TP
#> 3 mapsection3 Country       Canada       Canada    TP
#> 4 mapsection4 Country UnitedStates UnitedStates    TP
#> 5 mapsection5 Country UnitedStates UnitedStates    TP
#> 6 mapsection1  Region      Ontario      Ontario    TP

Then drop the value columns and spread to a wide shape again. You could do this and the joining in one step; breaking into two parts was just easier for explaining.

joined %>%
  select(-starts_with("value")) %>%
  spread(key = level, value = truth)
#>    MapSection Country Region Zone
#> 1 mapsection1      TP     TP   TP
#> 2 mapsection2      TP     TP   TP
#> 3 mapsection3      TP     FN   TN
#> 4 mapsection4      TP     TP   TP
#> 5 mapsection5      TP     TP   FP

^{Created on 2019-05-31 by the reprex package (v0.3.0)}

conditional if/else statements across two data frames in R

Answers (2)

Related Questions