Leila
Leila

Reputation: 27

r match data in two data frames then check text in another column for matched row

I have two data frames stu1 and stu2. Both have matching ID columns but different variables in other columns.

For example, stu1:

ID, Grade, Group, Age
ad1, A, Green, 14
bc1, B, Green, 13
cd1, B, Blue, 14
fs3, C, Red, 13

stu2:

ID, Prog, Loc, Year
bc1, LSC1, Ext, 2013
cd1, LSC1, Ext, 2013
cd1, BSC1, Int, 2013
ad1, BSC2, Int, 2012
rs2, KHL4, Ext, 2014

What I'm trying to do is check whether the student ID in stu1 exists in stu2 then check whether the text in another column for the corresponding row matches my string, e.g. Prog =='BSC*' then create a new column in stu1 which states "Yes" or "No".

So, the result for stu1 should be:

ID, Grade, Group, Age, BSCProg
ad1, A, Green, 14, Yes
bc1, B, Green, 13, No
cd1, B, Blue, 14, Yes
fs3, C, Red, 13, No

I've tried a number of different ways unsuccessfully, e.g:

stu1$BSCProg <- ifelse(stu2[grepl("BSC", stu2$Prog) & match(paste0(stu1$ID), 
    paste0(stu1$ID)),], "Yes", "No")

stu1$BSCProg <- ifelse(is.na(match(paste0(stu1$ID),
    paste0(stu2$ID) & stu2[grepl("BSC", stu2$Prog),])),"No","Yes")

stu1$BSCProg <- ifelse(stu1$ID %in% stu2$ID & grepl('BSC', stu2$Prog), "Yes", "No")

Upvotes: 0

Views: 1501

Answers (3)

www
www

Reputation: 39154

A dplyr and tidyr solution. stu3 is the final output.

library(dplyr)
library(tidyr)

stu1 <- data_frame(ID = c("ad1", "bc1", "cd1", "fs3"),
                   Grade = c("A", "B", "B", "C"),
                   Group = c("Green", "Green", "Blue", "Red"),
                   Age = c(14, 13, 14, 13))

stu2 <- data_frame(ID = c("bc1", "cd1", "cd1", "ad1", "rs2"),
                   Prog = c("LSC1", "LSC1", "BSC1", "BSC2", "KHL4"),
                   Loc = c("Ext", "Ext", "Int", "Int", "Ext"),
                   Year = c(2013, 2013, 2013, 2012, 2014))


stu3 <- stu1 %>%
  full_join(stu2 %>% select(ID, Prog), by = "ID") %>%
  mutate(BSCProg = ifelse(grepl("BSC", Prog), "Yes", "No")) %>%
  drop_na(Grade) %>%
  select(-Prog) %>%
  group_by(ID) %>%
  arrange(desc(BSCProg)) %>%
  slice(1)

Upvotes: 0

Eric Watt
Eric Watt

Reputation: 3230

I would do this by merging the two tables so that you can do column comparisons. Using data.table:

library(data.table)

setDT(stu1)
setDT(stu2)

dat <- merge(stu1,
             stu2[Prog %like% "BSC", .(ID, BSCProg = Prog)],
             by = "ID",
             all.x = TRUE)

dat[, BSCProg := ifelse(is.na(BSCProg), "No", "Yes")]

Result:

#     ID Grade Group Age BSCProg
# 1: ad1     A Green  14     Yes
# 2: bc1     B Green  13      No
# 3: cd1     B  Blue  14     Yes
# 4: fs3     C   Red  13      No

Unpack that a bit, the first step is to merge the ID and Prog columns from stu2 into stu1. The Prog %like% "BSC" section will merge only those rows where the Prog column has "BSC" as part of the value. BSCProg = Prog is to rename the column to what you want in the end.

When this is done, the values of the column will be NA or values like BSC1, BSC2. The final statement BSCProg := ifelse(is.na(BSCProg), "No", "Yes") will change any NA to "No" and turn anything else to "Yes".

Upvotes: 4

juan
juan

Reputation: 398

You can merge by ID first, then create your new columns. Here is a data.table solution:

 library(data.table)
 setDT(stu1, key="ID")
 setDT(stu2, key="ID")
 stu1 = merge(stu1, stu2, all.x=TRUE)
 stu1[, BSCProg:=ifelse(grepl("^BSC", Prog), "Yes", "No")]

Upvotes: 1

Related Questions