Reputation: 4200
I have a data.frame
containing countries and cities in the column location
, and I want to extract the former by matching with the world.cities$country.etc
dataframe from library(maps)
(or any other collection of country names).
Consider this example:
df <- data.frame(location = c("Aarup, Denmark",
"Switzerland",
"Estonia: Aaspere"),
other_col = c(2,3,4))
I attempted using this code
df %>% extract(location,
into = c("country", "rest_location"),
remove = FALSE,
function(x) x[which x %in% world.cities$country.etc])
But am not successful; I expect something like this:
location other_col country rest_location
1 Aarup, Denmark 2 Denmark Aarup,
2 Switzerland 3 Switzerland
3 Estonia: Aaspere 4 Estonia : Aaspere
Upvotes: 2
Views: 2191
Reputation: 17648
you can try this as a starting point
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(location) %>%
mutate(gr = location %in% world.cities$country.etc) %>%
mutate(gr = ifelse(gr, "country", "rest_location")) %>%
spread(gr, location) %>%
right_join(df %>%
rownames_to_column(),
by = c("rowname", "other_col")) %>%
select(location, other_col, country, rest_location)
location other_col country rest_location
1 Aarup, Denmark 2 Denmark Aarup
2 Switzerland 3 Switzerland <NA>
3 Estonia: Aaspere 4 Estonia Aaspere
Of note, this only works if there are only two "words" in the location column. If necessary you have to specify a suitable separate e.g. sep=",|:"
Upvotes: 4
Reputation: 5798
Base R (not including maps package):
# Import the library:
library(maps)
# Split the string on the spaces:
country_city_vec <- strsplit(df$location, "\\s+")
# Replicate the other col's rows by the split string vec:
rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)),
location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)
# Match with the world df:
matched_with_world_df <- merge(df,
setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
c("other_col", "country")),
by = "other_col", all.x = T)
# Extract the city/location drilldown:
matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
"",
gsub(paste0(matched_with_world_df$country,
collapse = "|"),
"", matched_with_world_df$location)), "both")
Upvotes: 1
Reputation: 389175
We can create a pattern of all country names by pasting them together and use str_extract_all
to get all the country names which match the pattern in location
and remove the words which match the country names to get rest_location
.
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$country <- sapply(str_extract_all(df$location, all_countries), toString)
df$rest_location <- str_remove_all(df$location, all_countries)
#OR can also do
#df$rest_location <- str_remove_all(df$location, df$country)
df
# location other_col country rest_location
#1 Aarup, Denmark 2 Denmark Aarup,
#2 Switzerland 3 Switzerland
#3 Estonia: Aaspere 4 Estonia : Aaspere
Using sapply
and toString
for country
because if there are more than one country names in location
they all are concatenated in one string.
Upvotes: 5