Danny
Danny

Reputation: 574

Matching data from multiple columns in r

I have two datasets:

Contacts2: This contains a list of ~100,000 contacts, their respective titles and a set of columns which describes the types of work contacts could be involved in. Here's an example dataset:

First<-c("George","Thomas","James","Jimmy","Howard","Herbert")
Last<-c("Washington", "Jefferson", "Madison", "Carter", "Taft", "Hoover")
Title<-c("CEO", "Accountant","Communications Specialist", "President", "Accountant", "CFO")
Finance<-NA
Executive<-NA
Communications<-NA

Contacts2<-as.data.frame(cbind(First,Last,Title,Finance,Executive,Communications))

    First       Last                     Title Finance Executive Communications
1  George Washington                       CEO    <NA>      <NA>           <NA>
2  Thomas  Jefferson                Accountant    <NA>      <NA>           <NA>
3   James    Madison Communications Specialist    <NA>      <NA>           <NA>
4   Jimmy     Carter                 President    <NA>      <NA>           <NA>
5  Howard       Taft                Accountant    <NA>      <NA>           <NA>
6 Herbert     Hoover                       CFO    <NA>      <NA>           <NA>

Note the last three columns are numeric.

TableOfTitle: This dataset contains a list of ~1,000 unique titles and the same set of columns in which describes the type of work the contacts could be involved in. For each title I've put an 1 in the column(s) of the roles that describe that person's job.

Title<-c("CEO","Accountant", "Communications Specialist", "President", "CFO")
Finance<-c(NA,1,NA,1,1)
Executive<-c(1,NA,NA,NA,1)
Communications<-c(NA,NA,1,NA,NA)
TableOfTitle<-as.data.frame(cbind(Title,Finance,Executive,Communications))

                      Title Finance Executive Communications
1                       CEO    <NA>         1           <NA>
2                Accountant       1      <NA>           <NA>
3 Communications Specialist    <NA>      <NA>              1
4                 President       1      <NA>           <NA>
5                       CFO       1         1           <NA>

Note the last three columns are numeric.

I'm now trying to match the check boxes in TableOfTitle in Contacts2 based on the contact title field. For example, since TableOfTitle shows anyone with the title of CFO should have an x in the Finance and Executive field, the record for Herbert Hoover in Contacts2 should also have 1s in those columns as well.

Upvotes: 2

Views: 54

Answers (1)

Marcus Campbell
Marcus Campbell

Reputation: 2796

Here's a solution that uses dplyr. It is essentially what some commenters have already recommended, except that this fulfills the request of not copying over any pre-existing data in the last 3 columns of Contacts2.

Note that ifelse() can be very slow with large datasets, but for your stated task this shouldn't really be noticeable. Algorithmically, this solution is also a bit clumsy in other ways, but I went for maximum readability here.

Contacts2 <- left_join(Contacts2, TableOfTitle, by = "Title") %>%
             transmute(First = First,
                       Last = Last,
                       Title = Title,
                       Finance = ifelse(is.na(Finance.x), Finance.y, Finance.x),
                       Executive = ifelse(is.na(Executive.x), Executive.y, Executive.x),
                       Communications = ifelse(is.na(Communications.x), Communications.y, Communications.x))

Example output:

First        Last                     Title Finance Executive Communications
George Washington                       CEO    <NA>        1           <NA>
Thomas  Jefferson                Accountant      1       <NA>          <NA>
James     Madison Communications Specialist    <NA>      <NA>            1
Jimmy      Carter                 President      1       <NA>          <NA>
Howard       Taft                Accountant      1       <NA>          <NA>
Herbert    Hoover                       CFO      1         1           <NA>

Upvotes: 1

Related Questions