Reputation: 21
I have a dataframe with a column that consists of politician names which are extracted of thousands of news articles. Each row is a specific article. I want to count which politicians are mentioned the most, but count each name only one time per article (row).
The entities recognition algorithm returned these results. Now I have to convert the names in a standard form to be able to summarise and compare them.
Because there are maximal 20 people I am interested in, I knew the names and thought despite the effort, manually coding the patterns for each name might be the fastest way (I am happy for other ideas).
#example-data
persons <- c("Merkel,Angela Merkel,Trump,Ursula,Merkels", "Ursula von,Trumps,Donald Trump,Leyen")
df <- data.frame(persons)
#change pattern
df <- df %>%
mutate(
persons= paste(" ",str_replace_all(df$persons,",", " , "), sep = "")
)
#example of exctracting the names.. and so on, you get the idea
str_replace_all(df$persons, c(" Trump(s)?" = "Donald Trump", ", Trump(s)?" = ", Donald Trump", "Donald Trumps" = "Donald Trump",
" Merkel(s)?") = "Angela Merkel")
My desired output is to have for each row just the full names. In the end I would remove the duplicated names per row and then I could count the dataset like desired.
The data would should look like this in the end:
persons <- c("Angela Merkel,Angela Merkel,Donald Trump,Ursula von der Leyen,Angela Merkel", "Ursula von der Leyen,Donald Trump,Donald Trump,Ursula von der Leyen")
I have especially a hard time with patterns for names which consists of more than two parts like Ursula von der Leyen
. What would the best way to do convert the names and how would the pattern for replacement look like?
Edit I wrote now a function, witch takes care that there is only one instance of a name in my dataframe for each row. Not really elegant and nice code but its working.
clean_name <- function(x) {
b <- unlist(strsplit(x, '[,]')) %>%
str_squish(.)
c <- b[!duplicated(b)]
#Lists mit forename and surname
ganzer_name <- vector()
nachname<-vector()
for (person in c){
if(any(str_count(person," ") == 0)){
nachname <- append(nachname,person)
} else{
ganzer_name <- append(ganzer_name,person)
}
}
#chckes if therese constructions like s wie Angela Merkels
ganzer_name <- ganzer_name%>%
str_sort() %>% str_replace(.,"\\+","")
i <-0
aussortieren <- c("","")
while(i < (length(ganzer_name)-1) ){
i <- i+1
if(str_detect(ganzer_name[i+1],
paste0(ganzer_name[i],"*")
)){
aussortieren <- append(aussortieren, ganzer_name[i+1] )
}else{ }
}
ganzer_name <- ganzer_name[!ganzer_name %in% aussortieren]
#check if surname is already in a full name
for( person in nachname ){
#check construction with s like Merkels
if(any(str_detect(ganzer_name, paste0(
"\\Q",
str_sub(person,end = nchar(person)-1),
"\\E" )
)
)
) {
} else {
ganzer_name <- append(ganzer_name,person)
}
}
return(paste(ganzer_name, collapse=","))
}
Upvotes: 0
Views: 112
Reputation: 389175
Instead of turning various combination of names into standard format, removing duplicates and then counting here is a different approach.
We can use grepl
for pattern matching and count how many times a politician occurs in different news articles.
name <- c('Trump', 'Merkel')
sapply(name, function(x) sum(grepl(x, df$persons)))
# Trump Merkel
# 2 1
Use ignore.case = TRUE
in grepl
if you want to make the comparison case insensitive.
Upvotes: 2