Devon O'Rourke
Devon O'Rourke

Reputation: 267

conditional if/else statements across two data frames in R

I have a data.frame of a few columns of known/expected results and want to compare these to a second data.frame with the same structure but potentially different results. There are four possible scenarios:

I'd like to to produce a new data.frame that evaluates those four conditions for each of the columns among the known/observed data.frames

I have come across posts that describe multiple conditions within a vector, but I haven't seen any information on comparing the elements of two vectors that share the same index (in R).

Let's say for simplicity you have a pair of maps: the "True" map is what we expect, and the "observed" map is what we have recorded. For both maps, these observations occur at identical locations, and each location is broken up into three parts (Country, Region, and Zone).

true_map <- data.frame(
  MapSection = paste0("mapsection", seq(1:5)),
  Country = c(rep("Canada", 3), rep("UnitedStates", 2)),
  Region = c(rep("Ontario", 3), "NewYork", "Alaska"),
  Zone = c("Toronto", "Ottawa", NA, "Albany", NA)
)

obsrvd_map <- data.frame(
  MapSection = paste0("mapsection", seq(1:5)),
  Country = c(rep("Canada", 3), rep("UnitedStates", 2)),
  Region = c(rep("Ontario", 2), NA, "NewYork", "Alaska"),
  Zone = c("Toronto", "Ottawa", NA, "Albany", "Fairbanks")
)

At the Country level, the true and observed maps are identical. At the Region level however, mapsection3 is missing information in the "observed" map (NA), while the "true" map has a location (Ontario). This would be classified as a FalseNegative, because we are missing information where we expect it to be. Conversely, mapsection5 is missing information in the "true" map at the Zone level, while the "observed" map contains information (Fairbanks). This would be classified as a FalsePositive because there is additional information where we do not expect information. However, both the "true" and "observed" maps are lacking information in the Zone level for mapsection3. This would be considered a TrueNegative because both are missing information.

The final result would summarize these observations as follows:

MapSection Country Region Zone
mapsection1      TP     TP   TP
mapsection2      TP     TP   TP
mapsection3      TP     FN   TN
mapsection4      TP     TP   TP
mapsection5      TP     TP   FP

Thank you for your assistance!

Upvotes: 1

Views: 751

Answers (2)

camille
camille

Reputation: 16871

The logic in your answer looks solid, it just doesn't yet scale to the other combinations you need. To do that, I'd reshape the data into a long form so you have one column of geographic levels and one of zones.

library(dplyr)
library(tidyr)

true_map_long <- true_map %>%
  gather(key = level, value = value, -MapSection)
obsrvd_map_long <- obsrvd_map %>%
  gather(key = level, value = value, -MapSection)

Both are shaped like:

head(true_map_long)
#>    MapSection   level        value
#> 1 mapsection1 Country       Canada
#> 2 mapsection2 Country       Canada
#> 3 mapsection3 Country       Canada
#> 4 mapsection4 Country UnitedStates
#> 5 mapsection5 Country UnitedStates
#> 6 mapsection1  Region      Ontario

Join these two long-shaped tables by map section and level, and give appropriate suffixes to make it clearer which is which. The case_when is essentially the same, but now you're not tied to one location.

joined <- inner_join(
  true_map_long,
  obsrvd_map_long,
  by = c("MapSection", "level"),
  suffix = c("_t", "_o")
) %>%
  mutate(truth = case_when(
    value_t == value_o  ~ "TP",
    is.na(value_t) == is.na(value_o)  ~ "TN",
    is.na(value_t) & !is.na(value_o)  ~ "FP",
    !is.na(value_t) & is.na(value_o)  ~ "FN",
  ))
head(joined)
#>    MapSection   level      value_t      value_o truth
#> 1 mapsection1 Country       Canada       Canada    TP
#> 2 mapsection2 Country       Canada       Canada    TP
#> 3 mapsection3 Country       Canada       Canada    TP
#> 4 mapsection4 Country UnitedStates UnitedStates    TP
#> 5 mapsection5 Country UnitedStates UnitedStates    TP
#> 6 mapsection1  Region      Ontario      Ontario    TP

Then drop the value columns and spread to a wide shape again. You could do this and the joining in one step; breaking into two parts was just easier for explaining.

joined %>%
  select(-starts_with("value")) %>%
  spread(key = level, value = truth)
#>    MapSection Country Region Zone
#> 1 mapsection1      TP     TP   TP
#> 2 mapsection2      TP     TP   TP
#> 3 mapsection3      TP     FN   TN
#> 4 mapsection4      TP     TP   TP
#> 5 mapsection5      TP     TP   FP

Created on 2019-05-31 by the reprex package (v0.3.0)

Upvotes: 1

Devon O&#39;Rourke
Devon O&#39;Rourke

Reputation: 267

Following Bruce's advice, I pulled out just a single location type and tried evaluating the four conditions I described. I think it's working...

Zone_df <- merge(true_map[,c("Zone", "MapSection")], obsrvd_map[,c("Zone", "MapSection")], by="MapSection") %>% 
  rename(., exp.Zone=Zone.x, obs.Zone=Zone.y) %>% 
  mutate_if(is.factor, as.character)

Zone_df %>% 
  mutate(EvalCase = case_when(
    (exp.Zone) == (obs.Zone)  ~ "TP",
    is.na(exp.Zone) == is.na(obs.Zone)  ~ "TN",
    is.na(exp.Zone) & !is.na(obs.Zone)  ~ "FP",
    !is.na(exp.Zone) & is.na(obs.Zone)  ~ "FN",
))

Upvotes: 0

Related Questions