mtotof
mtotof

Reputation: 69

Eliminating duplicate values based on criteria in R

I am cleaning an administrative dataset. I have two questions, and I am not sure of the logical order to answer them for my coding purposes. I have data similar to this:

name <- c("Smith, John J", "Smith, John Jay", "Smith, Jane", "Smith, Joe M")
event <- c('123', '123', '124', '125')
type <- c('s', 'a', 'v', 's')
df <- data.frame(name, event,type)

First, I want to eliminate the extra event record for individuals that have both type = 'a' and type = 's' records. When they have both, I only want the record with type = 'a' observation. Note: this is conditional on the event record being the same. As you can see in the df there are two event 123 records for "John Smith".

Second, does the fact that there individuals who have their middle name spelled out versus just an initial, and those without a listed middle name/initial in the name field an issue for this? If so, I was planning on separating that column out with:

separate(df, name, c('name','middle'), " ")

Ideally, my end goal would look like this:

           name event type
1 Smith, John J   123    a
2   Smith, Jane   124    v
3  Smith, Joe M   125    s

Upvotes: 1

Views: 59

Answers (2)

akrun
akrun

Reputation: 887028

After grouping by 'event' and a 'grp' created by removing the last word in 'name', then reorder the 'name' based on the occurrence of 'a' in 'type' and slice the last element

library(dplyr)
library(stringr)
df %>%
    group_by(event, grp = str_remove(name, "\\s*\\w+$")) %>%
    mutate(name = name[order(type != 'a')]) %>% 
    slice(n()) %>% 
    ungroup %>%
    select(-grp)
# A tibble: 3 x 3
# Groups:   event [3]
#  name          event type 
#  <fct>         <fct> <fct>
#1 Smith, John J 123   a    
#2 Smith, Jane   124   v    
#3 Smith, Joe M  125   s    

Upvotes: 1

Marius
Marius

Reputation: 60060

To do this while accounting for the possible inconsistencies in middle names, you'll have to split the name into its different components:

# Be careful about the stringsAsFactors setting
df <- data.frame(name, event, type, stringsAsFactors = FALSE)

df %>%
  separate(name, c("last", "other"), sep = ", ", remove = FALSE) %>%
  separate(other, c("first", "middle"), sep = " ", fill = "right") %>%
  # Treat people as the same individual if their first and last names
  #   match, ignore middle name
  group_by(first, last, event) %>%
  # Put 'a' records first
  arrange(desc(type == "a")) %>%
  slice(1)

Upvotes: 1

Related Questions