Reputation: 69
I am cleaning an administrative dataset. I have two questions, and I am not sure of the logical order to answer them for my coding purposes. I have data similar to this:
name <- c("Smith, John J", "Smith, John Jay", "Smith, Jane", "Smith, Joe M")
event <- c('123', '123', '124', '125')
type <- c('s', 'a', 'v', 's')
df <- data.frame(name, event,type)
First, I want to eliminate the extra event
record for individuals that have both type = 'a'
and type = 's'
records. When they have both, I only want the record with type = 'a'
observation. Note: this is conditional on the event
record being the same. As you can see in the df
there are two event 123
records for "John Smith".
Second, does the fact that there individuals who have their middle name spelled out versus just an initial, and those without a listed middle name/initial in the name
field an issue for this? If so, I was planning on separating that column out with:
separate(df, name, c('name','middle'), " ")
Ideally, my end goal would look like this:
name event type
1 Smith, John J 123 a
2 Smith, Jane 124 v
3 Smith, Joe M 125 s
Upvotes: 1
Views: 59
Reputation: 887028
After grouping by 'event' and a 'grp' created by removing the last word in 'name', then reorder the 'name' based on the occurrence of 'a' in 'type' and slice
the last element
library(dplyr)
library(stringr)
df %>%
group_by(event, grp = str_remove(name, "\\s*\\w+$")) %>%
mutate(name = name[order(type != 'a')]) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 3 x 3
# Groups: event [3]
# name event type
# <fct> <fct> <fct>
#1 Smith, John J 123 a
#2 Smith, Jane 124 v
#3 Smith, Joe M 125 s
Upvotes: 1
Reputation: 60060
To do this while accounting for the possible inconsistencies in middle names, you'll have to split the name into its different components:
# Be careful about the stringsAsFactors setting
df <- data.frame(name, event, type, stringsAsFactors = FALSE)
df %>%
separate(name, c("last", "other"), sep = ", ", remove = FALSE) %>%
separate(other, c("first", "middle"), sep = " ", fill = "right") %>%
# Treat people as the same individual if their first and last names
# match, ignore middle name
group_by(first, last, event) %>%
# Put 'a' records first
arrange(desc(type == "a")) %>%
slice(1)
Upvotes: 1