Mike
Mike

Reputation: 2097

Creating Groups with Dplyr's "group_by" then Using Stringr to Find Differences Between Groups

Using the example below, I want to group the dataframe by CaseWorker, then Client, then determine for each Client group whether the list of tasks in "Task" is the same as the list of tasks in "Task2".

I would be happy witha simple true or false, or better yet, if each task that is in "Task2" but not "Task" could be extracted and displayed in a new column or dataframe.

So basically I need to make sure "Task" and "Task2" contain the same entries for each individual Client.

I would like to stick with Dplyr and Stringr if possible, or at least stay within the Tidyverse. I'm thinking there's some way of using "group_by" and "str_detect" or some other Stringr functionality to achieve this in an elegant manner.

CaseWorker<-c("John","John","John","John","John","John","Melanie","Melanie","Melanie","Melanie","Melanie","Melanie")
Client<-c("Chris","Chris","Chris","Tom","Tom","Tom","Valerie","Valerie","Valerie","Tim","Tim","Tim")
Task<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Make lunch","Make dinner","Feed cat","Buy groceries","Do homework","Iron shirt","Make lunch")
Task2<-c("Feed cat","Make dinner","Iron shirt","Make dinner","Do homework","Feed cat","Make dinner","Feed cat","Iron shirt","Do homework","Iron shirt","Make lunch")
Df<-data.frame(CaseWorker,Client,Task,Task2)

Upvotes: 0

Views: 321

Answers (4)

Carl Boneri
Carl Boneri

Reputation: 2722

This might just be me misinterpreting the question, but I think you might be over-complicating this in the event that what you want is simply the records where Task does not match Task2.

> Df[which(Df$Task != Df$Task2),]

===  ==========  =======  =============  ==========
\    CaseWorker  Client   Task           Task2     
===  ==========  =======  =============  ==========
6    John        Tom      Make lunch     Feed cat  
9    Melanie     Valerie  Buy groceries  Iron shirt
===  ==========  =======  =============  ==========

Upvotes: 0

karthikbharadwaj
karthikbharadwaj

Reputation: 368

If you would like to use stringr package. The below could also work for you.

Df %>% 
     group_by(CaseWorker,Client) %>% 
     mutate(Check=str_detect(as.character(Task),as.character(Task2))

Upvotes: 0

Jake Kaupp
Jake Kaupp

Reputation: 8072

You can do this simply by dplyr and using %in%

Df %>% 
  group_by(CaseWorker,Client) %>% 
  mutate(Check = Task %in% Task2) 

This hinges on exact case matching, if you're worried about that you could the following:

 Df %>% 
  group_by(CaseWorker,Client) %>% 
  rowwise() %>% 
  mutate(Check = grepl(Task, Task2, ignore.case = TRUE)) 

but you have to use rowwise prior to the mutate to work around the vectorized nature of grepl (or most R functions)

Upvotes: 1

Daniel Anderson
Daniel Anderson

Reputation: 2424

See if this is what you're after.

First, see if Task matches Task2. If not, return Task2 as a new variable. I stored this into a new data frame df2

df2 <- Df %>% 
    mutate(match = Task == Task2,
           non_match = ifelse(!match, Task2, "")) 
df2

#    CaseWorker  Client          Task       Task2 match  non_match
# 1        John   Chris      Feed cat    Feed cat  TRUE           
# 2        John   Chris   Make dinner Make dinner  TRUE           
# 3        John   Chris    Iron shirt  Iron shirt  TRUE           
# 4        John     Tom   Make dinner Make dinner  TRUE           
# 5        John     Tom   Do homework Do homework  TRUE           
# 6        John     Tom    Make lunch    Feed cat FALSE   Feed cat
# 7     Melanie Valerie   Make dinner Make dinner  TRUE           
# 8     Melanie Valerie      Feed cat    Feed cat  TRUE           
# 9     Melanie Valerie Buy groceries  Iron shirt FALSE Iron shirt
# 10    Melanie     Tim   Do homework Do homework  TRUE           
# 11    Melanie     Tim    Iron shirt  Iron shirt  TRUE           
# 12    Melanie     Tim    Make lunch  Make lunch  TRUE           

Then summarise the results to see if individual CaseWorker/Client pairs match for all entries.

df2 %>% 
   group_by(CaseWorker, Client) %>% 
   summarise(n = n(),
             matches = sum(match),
             all_match = n == matches)

#   CaseWorker  Client     n matches all_match
#        <chr>   <chr> <int>   <int>     <lgl>
# 1       John   Chris     3       3      TRUE
# 2       John     Tom     3       2     FALSE
# 3    Melanie     Tim     3       3      TRUE
# 4    Melanie Valerie     3       2     FALSE

You could then of course merge this back into your data frame if you needed the all_match variable in your original dataset.

Upvotes: 2

Related Questions