Lee
Lee

Reputation: 21

Finding and Replacing parts of person names

I have a dataframe with a column that consists of politician names which are extracted of thousands of news articles. Each row is a specific article. I want to count which politicians are mentioned the most, but count each name only one time per article (row).

The entities recognition algorithm returned these results. Now I have to convert the names in a standard form to be able to summarise and compare them.

Because there are maximal 20 people I am interested in, I knew the names and thought despite the effort, manually coding the patterns for each name might be the fastest way (I am happy for other ideas).

#example-data 
persons <- c("Merkel,Angela Merkel,Trump,Ursula,Merkels", "Ursula von,Trumps,Donald Trump,Leyen") 
df <- data.frame(persons)

#change pattern
df <- df %>%
  mutate(
  persons= paste("  ",str_replace_all(df$persons,",", " , "), sep = "")
) 

#example of exctracting the names.. and so on, you get the idea
str_replace_all(df$persons, c("   Trump(s)?" = "Donald Trump", ", Trump(s)?" = ", Donald Trump", "Donald Trumps" = "Donald Trump",
                              "     Merkel(s)?") = "Angela Merkel")

My desired output is to have for each row just the full names. In the end I would remove the duplicated names per row and then I could count the dataset like desired.

The data would should look like this in the end: persons <- c("Angela Merkel,Angela Merkel,Donald Trump,Ursula von der Leyen,Angela Merkel", "Ursula von der Leyen,Donald Trump,Donald Trump,Ursula von der Leyen")

I have especially a hard time with patterns for names which consists of more than two parts like Ursula von der Leyen. What would the best way to do convert the names and how would the pattern for replacement look like?

Edit I wrote now a function, witch takes care that there is only one instance of a name in my dataframe for each row. Not really elegant and nice code but its working.

clean_name <- function(x) {
  b <- unlist(strsplit(x, '[,]')) %>%
    str_squish(.)
  c <- b[!duplicated(b)]
  #Lists mit forename and surname
  ganzer_name <- vector()
  nachname<-vector()
for (person in c){
  if(any(str_count(person," ") == 0)){
    nachname <- append(nachname,person)
  } else{
    ganzer_name <- append(ganzer_name,person)
  }
}
  
#chckes if therese constructions like s wie Angela Merkels 
  ganzer_name <- ganzer_name%>%
    str_sort() %>% str_replace(.,"\\+","")
  i <-0
aussortieren <- c("","")

while(i < (length(ganzer_name)-1) ){
  i <- i+1
  if(str_detect(ganzer_name[i+1],
                paste0(ganzer_name[i],"*")
  )){
    aussortieren <- append(aussortieren, ganzer_name[i+1] )
  }else{  }
}

ganzer_name <- ganzer_name[!ganzer_name %in% aussortieren]
  
#check if surname is already in a full name
for( person in nachname ){
    #check construction with s like Merkels
      if(any(str_detect(ganzer_name, paste0(
        "\\Q",
        str_sub(person,end  = nchar(person)-1),
        "\\E" )  
                         )
             )  
      ) {
      } else {
        ganzer_name <- append(ganzer_name,person)
      }
    }
return(paste(ganzer_name, collapse=","))
}

Upvotes: 0

Views: 112

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389175

Instead of turning various combination of names into standard format, removing duplicates and then counting here is a different approach.

We can use grepl for pattern matching and count how many times a politician occurs in different news articles.

name <- c('Trump', 'Merkel')
sapply(name, function(x) sum(grepl(x, df$persons)))

# Trump Merkel 
#     2      1 

Use ignore.case = TRUE in grepl if you want to make the comparison case insensitive.

Upvotes: 2

Related Questions