Reputation: 25
I am currently trying to count the absolute number of countries in a long string. I have loaded a data frame named "countries" with column "Countries", consisting of all countries in the world. I want to make a function that searches any string, loop over all the country-names in my df and return the sum of occurrences of any country-name. (I.e. the total number of countries mentioned)
Code:
number.of.countries <- function(str){
# #Initialize
countcountry <- 0
# #loop over all countries:
for (i in countries$Countries){
# #Logical test:
countries_mentioned <- grepl(i, str, perl = T, ignore.case = T)
# #add to the count
if (isTRUE(countries_mentioned)){
countcountry <- countcountry + str_count(str, fixed(countries$Countries[i], ignore_case = TRUE))
}
}
#Output
return(countcountry)
}
###When running the function:
> number.of.countries(str)
[1] NA
Upvotes: 2
Views: 53
Reputation: 4187
I guess you have multiple strings you want to check for countries, then you could do:
# example data
longstring <- c("The countries austria and Albania are in Europe, while Australia is not. Austria is the richest of the two European countries.",
"In this second sentence we stress the fact that Australia is part of Australia.")
countries <- c("Austria","Albania","Australia","Azerbeyan")
With lapply
and stri_count_fixed
from the stringi
-package (in which you can specify what to do with case sensitivity) you can get the counts for each country:
library(stringi)
l <- lapply(longstring, stri_count_fixed, pattern = countries, case_insensitive = TRUE)
The result:
[[1]]
[1] 2 1 1 0
[[2]]
[1] 0 0 2 0
Now you can transform that in a dataframe with:
countdf <- setNames(do.call(rbind.data.frame, l), countries)
countdf$total <- rowSums(countdf)
The final result:
> countdf
Austria Albania Australia Azerbeyan total
1 2 1 1 0 4
2 0 0 2 0 2
NOTE:
To demonstrate the working of case_insensitive = TRUE
I started the first appearance of "Austria" in longstring
with a lower a
.
Upvotes: 0
Reputation: 783
You can vectorise your answer to make your code shorter and speed up your function. An example would be:
library(stringr)
number.countries <- function(str,dictionary){
return(sum(str_count(str,dictionary)))
}
number.countries("England and Ireland, oh and also Wales", c("Wales","Ireland","England"))
[1] 3
which can be passed a custom dictionary (in your case countries$Countries
)
Upvotes: 1