Santosh M.
Santosh M.

Reputation: 2454

Selecting columns based on matching/fuzzy matching value of two columns

I have data frame df1 with columns "Year" and "Agent":

df1 <- structure(list(Year = c(1999, 1999, 1998), Agent = list(c("abn-amro-nv", "suntrust banks", "wachovia"), c("jp morgan", "abn-amro-nv"), c("ba-corp", "boston bks", "nbd"))), .Names = c("Year", "Agent"), row.names = c(NA, -3L), class = "data.frame")

#df1
Year                                   Agent
1999   abn-amro-nv, suntrust banks, wachovia
1999                  jp morgan, abn-amro-nv
1998                ba-corp, boston bks, nbd

I have another data frame df2 which has 5 columns "Rank", "Arrangers", "Share", "Issues" and "Year", as following:

df2 <- structure(list(Rank = 1:3, Arranger = c("jp morgan", "boston-bank", "suntrust bk"), Share = c(1.2, 1.8, 2.1), Issues = c(7L, 4L, 3L), Year = c(1999L, 1998L, 1999L)), .Names = c("Rank", "Arranger", "Share", "Issues", "Year"), class = "data.frame", row.names = c(NA, -3L))

#df2
Rank    Arranger         Share    Issues    Year
 1      jp morgan        1.2       7        1999
 2      boston-bank      1.8       4        1998
 3      suntrust bk      2.1       3        1999

I need to match "Agent" and "Year" of df1 with "Arranger" and "Year" of df2 and select 3 columns from df2 such as "Rank", "Share", "Issues". Matching of "Agent" of df1 with "Arranger" of df2 will be fuzzy matching. It's because they are not exactly same.

My original data frames are very large just for your information.

Following is my code:

library(stringdist)

leadrep <- matrix(ncol=3, nrow=length(df1$Agent))

for (i in 1:length(df1$Agent)) {

  for (j in 1:length(df2$Arrangers)) {

      if ((ain(df2$Arrangers[j], df1$Agent[[i]], maxDist=0.3, 
      method="jw")) == 'TRUE' & (df1$Year[i] == df2$Year[j])){             

         leadrep[i,] <-  df2[j, c('Rank', 'Mkt.Share', 'NumberofIssues')]

      }

  }
} 

In the code, leadrep is matrix which I created. So that I can populate in the for loop. I used stringdist package and ain function to do fuzzy matching. And I used two for loop and if to compare the strings and year.

The above code works but it's taking too long time to get the results since my data frames are large. I really think that my approach above is not an efficient. It would be a great help if someone provides me a better alternative to my existing code.

Thank you for all your help.

Upvotes: 1

Views: 783

Answers (1)

aichao
aichao

Reputation: 7445

The following approach uses foreach instead of your two nested for loops, which should make your computation much faster on large data frames. See this for a nice overview of the package. You should also look at the vignettes.

library(foreach)
library(stringdist)

match.cond <- function(ij, df1, df2) {                                   ## 1.
  i = floor((ij-1) / nrow(df2)) + 1
  j = ij - (i-1) * nrow(df2)
  if ((ain(df2$Arranger[j], df1$Agent[[i]], maxDist=0.3, 
       method="jw")) == 'TRUE' & (df1$Year[i] == df2$Year[j])){
    return(df2[j, c('Rank', 'Share', 'Issues')])
  }
}

leadrep <- foreach(ij = 1:(nrow(df1)*nrow(df2)), .combine=rbind) %do%    ## 2.
  match.cond(ij, df1, df2)

Notes:

  1. match.cond is your match condition encapsulated into a kernel function that will be vectorized by foreach over all pairings of rows from df1 and df2. Its inputs are ij, which is an index to the pairings, and the two data frames. Within match.cond:
    • ij is converted to the row indices i for df1 and j for df2
    • Your condition is evaluated, and if the condition is met,
    • The columns from df2 for the matched row is returned.
  2. This is the foreach call.
    • We loop over the indices ij from 1 to nrow(df1)*nrow(df2), which enumerates all pairings of rows from df1 and df2 and %do% the match.cond function. Note that this is all one line.
    • The .combine=rbind argument states that we want to gather all results from match.cond and bind them as rows.
    • This returns the data frame leadrep

I've tested on your data, which I've dput:

df1 <- structure(list(Year = c(1999, 1999, 1998), Agent = list(c("abn-amro-nv", 
"suntrust banks", "wachovia"), c("jp morgan", "abn-amro-nv"), 
c("ba-corp", "boston bks", "nbd"))), .Names = c("Year", "Agent"
), row.names = c(NA, -3L), class = "data.frame")

df2 <- structure(list(Rank = 1:3, Arranger = c("jp morgan", "boston-bank", 
"suntrust bk"), Share = c(1.2, 1.8, 2.1), Issues = c(7L, 4L, 
3L), Year = c(1999L, 1998L, 1999L)), .Names = c("Rank", "Arranger", 
"Share", "Issues", "Year"), class = "data.frame", row.names = c(NA, 
-3L))

which gives me the desired result:

print(leadrep)
##   Rank Share Issues
##3     3   2.1      3
##2     1   1.2      7
##21    2   1.8      4

Hope this helps.

Upvotes: 1

Related Questions