Reputation: 267
I have a data.frame of a few columns of known/expected results and want to compare these to a second data.frame with the same structure but potentially different results. There are four possible scenarios:
I'd like to to produce a new data.frame that evaluates those four conditions for each of the columns among the known/observed data.frames
I have come across posts that describe multiple conditions within a vector, but I haven't seen any information on comparing the elements of two vectors that share the same index (in R).
Let's say for simplicity you have a pair of maps: the "True" map is what we expect, and the "observed" map is what we have recorded. For both maps, these observations occur at identical locations, and each location is broken up into three parts (Country, Region, and Zone).
true_map <- data.frame(
MapSection = paste0("mapsection", seq(1:5)),
Country = c(rep("Canada", 3), rep("UnitedStates", 2)),
Region = c(rep("Ontario", 3), "NewYork", "Alaska"),
Zone = c("Toronto", "Ottawa", NA, "Albany", NA)
)
obsrvd_map <- data.frame(
MapSection = paste0("mapsection", seq(1:5)),
Country = c(rep("Canada", 3), rep("UnitedStates", 2)),
Region = c(rep("Ontario", 2), NA, "NewYork", "Alaska"),
Zone = c("Toronto", "Ottawa", NA, "Albany", "Fairbanks")
)
At the Country
level, the true and observed maps are identical. At the Region
level however, mapsection3 is missing information in the "observed" map (NA
), while the "true" map has a location (Ontario
). This would be classified as a FalseNegative, because we are missing information where we expect it to be.
Conversely, mapsection5 is missing information in the "true" map at the Zone
level, while the "observed" map contains information (Fairbanks
). This would be classified as a FalsePositive because there is additional information where we do not expect information. However, both the "true" and "observed" maps are lacking information in the Zone
level for mapsection3. This would be considered a TrueNegative because both are missing information.
The final result would summarize these observations as follows:
MapSection Country Region Zone
mapsection1 TP TP TP
mapsection2 TP TP TP
mapsection3 TP FN TN
mapsection4 TP TP TP
mapsection5 TP TP FP
Thank you for your assistance!
Upvotes: 1
Views: 751
Reputation: 16871
The logic in your answer looks solid, it just doesn't yet scale to the other combinations you need. To do that, I'd reshape the data into a long form so you have one column of geographic levels and one of zones.
library(dplyr)
library(tidyr)
true_map_long <- true_map %>%
gather(key = level, value = value, -MapSection)
obsrvd_map_long <- obsrvd_map %>%
gather(key = level, value = value, -MapSection)
Both are shaped like:
head(true_map_long)
#> MapSection level value
#> 1 mapsection1 Country Canada
#> 2 mapsection2 Country Canada
#> 3 mapsection3 Country Canada
#> 4 mapsection4 Country UnitedStates
#> 5 mapsection5 Country UnitedStates
#> 6 mapsection1 Region Ontario
Join these two long-shaped tables by map section and level, and give appropriate suffixes to make it clearer which is which. The case_when
is essentially the same, but now you're not tied to one location.
joined <- inner_join(
true_map_long,
obsrvd_map_long,
by = c("MapSection", "level"),
suffix = c("_t", "_o")
) %>%
mutate(truth = case_when(
value_t == value_o ~ "TP",
is.na(value_t) == is.na(value_o) ~ "TN",
is.na(value_t) & !is.na(value_o) ~ "FP",
!is.na(value_t) & is.na(value_o) ~ "FN",
))
head(joined)
#> MapSection level value_t value_o truth
#> 1 mapsection1 Country Canada Canada TP
#> 2 mapsection2 Country Canada Canada TP
#> 3 mapsection3 Country Canada Canada TP
#> 4 mapsection4 Country UnitedStates UnitedStates TP
#> 5 mapsection5 Country UnitedStates UnitedStates TP
#> 6 mapsection1 Region Ontario Ontario TP
Then drop the value columns and spread to a wide shape again. You could do this and the joining in one step; breaking into two parts was just easier for explaining.
joined %>%
select(-starts_with("value")) %>%
spread(key = level, value = truth)
#> MapSection Country Region Zone
#> 1 mapsection1 TP TP TP
#> 2 mapsection2 TP TP TP
#> 3 mapsection3 TP FN TN
#> 4 mapsection4 TP TP TP
#> 5 mapsection5 TP TP FP
Created on 2019-05-31 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 267
Following Bruce's advice, I pulled out just a single location type and tried evaluating the four conditions I described. I think it's working...
Zone_df <- merge(true_map[,c("Zone", "MapSection")], obsrvd_map[,c("Zone", "MapSection")], by="MapSection") %>%
rename(., exp.Zone=Zone.x, obs.Zone=Zone.y) %>%
mutate_if(is.factor, as.character)
Zone_df %>%
mutate(EvalCase = case_when(
(exp.Zone) == (obs.Zone) ~ "TP",
is.na(exp.Zone) == is.na(obs.Zone) ~ "TN",
is.na(exp.Zone) & !is.na(obs.Zone) ~ "FP",
!is.na(exp.Zone) & is.na(obs.Zone) ~ "FN",
))
Upvotes: 0