Reputation: 2454
I have data frame df1
with columns "Year"
and "Agent"
:
df1 <- structure(list(Year = c(1999, 1999, 1998), Agent = list(c("abn-amro-nv", "suntrust banks", "wachovia"), c("jp morgan", "abn-amro-nv"), c("ba-corp", "boston bks", "nbd"))), .Names = c("Year", "Agent"), row.names = c(NA, -3L), class = "data.frame")
#df1
Year Agent
1999 abn-amro-nv, suntrust banks, wachovia
1999 jp morgan, abn-amro-nv
1998 ba-corp, boston bks, nbd
I have another data frame df2
which has 5 columns "Rank"
, "Arrangers"
, "Share"
, "Issues"
and "Year"
, as following:
df2 <- structure(list(Rank = 1:3, Arranger = c("jp morgan", "boston-bank", "suntrust bk"), Share = c(1.2, 1.8, 2.1), Issues = c(7L, 4L, 3L), Year = c(1999L, 1998L, 1999L)), .Names = c("Rank", "Arranger", "Share", "Issues", "Year"), class = "data.frame", row.names = c(NA, -3L))
#df2
Rank Arranger Share Issues Year
1 jp morgan 1.2 7 1999
2 boston-bank 1.8 4 1998
3 suntrust bk 2.1 3 1999
I need to match "Agent"
and "Year"
of df1
with "Arranger"
and "Year"
of df2
and select 3 columns from df2
such as "Rank"
, "Share"
, "Issues"
.
Matching of "Agent"
of df1
with "Arranger"
of df2
will be fuzzy matching. It's because they are not exactly same.
My original data frames are very large just for your information.
Following is my code:
library(stringdist)
leadrep <- matrix(ncol=3, nrow=length(df1$Agent))
for (i in 1:length(df1$Agent)) {
for (j in 1:length(df2$Arrangers)) {
if ((ain(df2$Arrangers[j], df1$Agent[[i]], maxDist=0.3,
method="jw")) == 'TRUE' & (df1$Year[i] == df2$Year[j])){
leadrep[i,] <- df2[j, c('Rank', 'Mkt.Share', 'NumberofIssues')]
}
}
}
In the code, leadrep
is matrix which I created. So that I can populate in the for loop
. I used stringdist
package and ain
function to do fuzzy matching. And I used two for loop
and if
to compare the strings and year.
The above code works but it's taking too long time to get the results since my data frames are large. I really think that my approach above is not an efficient. It would be a great help if someone provides me a better alternative to my existing code.
Thank you for all your help.
Upvotes: 1
Views: 783
Reputation: 7445
The following approach uses foreach
instead of your two nested for
loops, which should make your computation much faster on large data frames. See this for a nice overview of the package. You should also look at the vignettes.
library(foreach)
library(stringdist)
match.cond <- function(ij, df1, df2) { ## 1.
i = floor((ij-1) / nrow(df2)) + 1
j = ij - (i-1) * nrow(df2)
if ((ain(df2$Arranger[j], df1$Agent[[i]], maxDist=0.3,
method="jw")) == 'TRUE' & (df1$Year[i] == df2$Year[j])){
return(df2[j, c('Rank', 'Share', 'Issues')])
}
}
leadrep <- foreach(ij = 1:(nrow(df1)*nrow(df2)), .combine=rbind) %do% ## 2.
match.cond(ij, df1, df2)
Notes:
match.cond
is your match condition encapsulated into a kernel function that will be vectorized by foreach
over all pairings of rows from df1
and df2
. Its inputs are ij
, which is an index to the pairings, and the two data frames. Within match.cond
:
ij
is converted to the row indices i
for df1
and j
for df2
df2
for the matched row is returned.foreach
call.
ij
from 1
to nrow(df1)*nrow(df2)
, which enumerates all pairings of rows from df1
and df2
and %do%
the match.cond
function. Note that this is all one line..combine=rbind
argument states that we want to gather all results from match.cond
and bind them as rows.leadrep
I've tested on your data, which I've dput
:
df1 <- structure(list(Year = c(1999, 1999, 1998), Agent = list(c("abn-amro-nv",
"suntrust banks", "wachovia"), c("jp morgan", "abn-amro-nv"),
c("ba-corp", "boston bks", "nbd"))), .Names = c("Year", "Agent"
), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(Rank = 1:3, Arranger = c("jp morgan", "boston-bank",
"suntrust bk"), Share = c(1.2, 1.8, 2.1), Issues = c(7L, 4L,
3L), Year = c(1999L, 1998L, 1999L)), .Names = c("Rank", "Arranger",
"Share", "Issues", "Year"), class = "data.frame", row.names = c(NA,
-3L))
which gives me the desired result:
print(leadrep)
## Rank Share Issues
##3 3 2.1 3
##2 1 1.2 7
##21 2 1.8 4
Hope this helps.
Upvotes: 1