Ivo
Ivo

Reputation: 4200

extract country names (or other entity) from column

I have a data.frame containing countries and cities in the column location, and I want to extract the former by matching with the world.cities$country.etc dataframe from library(maps) (or any other collection of country names).

Consider this example:

df <- data.frame(location = c("Aarup, Denmark",
                              "Switzerland",
                              "Estonia: Aaspere"),
                 other_col = c(2,3,4))

I attempted using this code

df %>% extract(location,
               into = c("country", "rest_location"),
               remove = FALSE,
               function(x) x[which x %in% world.cities$country.etc])

But am not successful; I expect something like this:

          location other_col     country rest_location
1   Aarup, Denmark         2     Denmark       Aarup, 
2      Switzerland         3 Switzerland              
3 Estonia: Aaspere         4     Estonia     : Aaspere

Upvotes: 2

Views: 2191

Answers (3)

Roman
Roman

Reputation: 17648

you can try this as a starting point

library(tidyverse)
df %>% 
  rownames_to_column() %>% 
  separate_rows(location) %>% 
  mutate(gr = location %in% world.cities$country.etc) %>% 
  mutate(gr = ifelse(gr, "country", "rest_location")) %>% 
  spread(gr, location) %>% 
  right_join(df %>% 
              rownames_to_column(), 
              by = c("rowname", "other_col")) %>% 
  select(location, other_col, country, rest_location)
          location other_col     country rest_location
1   Aarup, Denmark         2     Denmark         Aarup
2      Switzerland         3 Switzerland          <NA>
3 Estonia: Aaspere         4     Estonia       Aaspere

Of note, this only works if there are only two "words" in the location column. If necessary you have to specify a suitable separate e.g. sep=",|:"

Upvotes: 4

hello_friend
hello_friend

Reputation: 5798

Base R (not including maps package):

# Import the library: 

library(maps)

# Split the string on the spaces: 

country_city_vec <- strsplit(df$location, "\\s+")

# Replicate the other col's rows by the split string vec: 

rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)), 

                            location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)

# Match with the world df: 

matched_with_world_df <- merge(df,

                               setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
                                        c("other_col", "country")),

                               by = "other_col", all.x = T)

# Extract the city/location drilldown: 

matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
                                                   "",
                                                   gsub(paste0(matched_with_world_df$country,
                                                               collapse = "|"),
                                           "", matched_with_world_df$location)), "both")

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389175

We can create a pattern of all country names by pasting them together and use str_extract_all to get all the country names which match the pattern in location and remove the words which match the country names to get rest_location.

library(maps)
library(stringr)

all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$country <- sapply(str_extract_all(df$location, all_countries), toString)
df$rest_location <- str_remove_all(df$location, all_countries)
#OR can also do
#df$rest_location <- str_remove_all(df$location, df$country)

df
#          location other_col     country rest_location
#1   Aarup, Denmark         2     Denmark       Aarup, 
#2      Switzerland         3 Switzerland              
#3 Estonia: Aaspere         4     Estonia     : Aaspere

Using sapply and toString for country because if there are more than one country names in location they all are concatenated in one string.

Upvotes: 5

Related Questions