Reputation: 2225
I am new in r and this may sound like a silly question but I was trying to join two tables by country
but got nan
in some as their names do not match. (data from covid19.analytics & world bank data - WDI)
Now I have created a mapping table for country names that didn't match with country names from both table.
How can I join them to replace nan for the mis-matched country name rows ?
Data sources:
library(tidyverse)
url_country_stats <- "https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/country_stats.csv"
url__ts <- "https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/ts_all_long.csv"
country_stats <- read.csv(url(url_country_stats))
ts_all_long <- read.csv(url(url_ts))
joining tables:
ts_all_long <- ts_all_long %>%
left_join(y = country_stats %>%
select(-c("year","longitude","latitude","lending","capital")),
by = c("Country.Region" = "country")) %>%
relocate(c("Confirmed_daily","Recovered_daily","Death_daily","Case_Fatality_Ratio_prcnt","Confirmed","Recovered","Death"),
.after = income)
ts_all_long
Countries names that got mismatched/ gives na
:
ts_all_long %>%
filter(is.na(iso2c),
date == max(date)) %>%
pull(Country.Region)
[1] "Burma" "Congo (Kinshasa)"
[3] "Czechia" "Egypt"
[5] "Iran" "Korea, South"
[7] "Kyrgyzstan" "Russia"
[9] "Saint Vincent and the Grenadines" "Slovakia"
[11] "Taiwan*" "US"
[13] "Venezuela" "Yemen"
Mapping table for non matching country names from both table:
mapping_tbl <- data.frame(tbl_ts_all_long = c("Burma","Congo (Kinshasa)","Czechia","Egypt","Iran","Korea, South",
"Kyrgyzstan","Russia","Saint Vincent and the Grenadines","Slovakia","Taiwan*",
"US","Venezuela","Yemen"),
tbl_country_stats = c("Myanmar","Congo, Dem. Rep.","Czech Republic","Egypt, Arab Rep.","Iran, Islamic Rep.","Korea, Rep.",
"Kyrgyz Republic","Russian Federation","St. Vincent and the Grenadines","Slovak Republic","Taiwan Not Available","United States","Venezuela, RB","Yemen, Rep.")
)
mapping_tbl
Now how can I join them to get values for above mentioned countries that have nan in their rows ?
Upvotes: 1
Views: 194
Reputation: 886948
We could do a join on the 'mapping_tbl' first, coalesce
the 'Country.Region' and the 'tbl_ts_all_long', then use the OP's code
out <- ts_all_long %>%
left_join(mapping_tbl, by = c("Country.Region" = "tbl_ts_all_long")) %>%
mutate(Country.Region = coalesce(tbl_country_stats, Country.Region)) %>%
left_join(y = country_stats %>%
select(-c("year","longitude","latitude","lending","capital")),
by = c("Country.Region" = "country")) %>%
relocate(c("Confirmed_daily","Recovered_daily","Death_daily",
"Case_Fatality_Ratio_prcnt","Confirmed","Recovered","Death"),
.after = income)
By checking the output
out %>%
filter(is.na(iso2c),
date == max(date)) %>%
pull(Country.Region)
#[1] "Namibia" "Taiwan Not Available"
the number of mismatches got reduced and the ones that are not matching should also be included in the 'mapping_tbl'
Upvotes: 1