Reputation: 23

R: Finding similiar/ "duplicate" PAIRS of entries based on multiple variables/ columns/ constraints

I am relatively new to R and absolutely new to stackoverflow (having researched a lot here anyway as I have some prior experience in Stata, Excel, VBA and little C).

I have a R dataframe df1 that looks like the following example, just with a few thousand rows:

ID       Date         Value  Class  ZIP 
TRA0001  2007-09-25   150    1      75019
TRA0002  2002-08-09   200    2      30152
TRA0003  2010-08-31   500    3      12451
TRA0004  2005-06-17   75     1      45242
TRA0005  2010-08-26   410    3      14618
TRA0006  2008-07-07   155    1      70139
TRA0007  2010-01-15   450    3      12883
TRA0008  2000-11-03   80     4      45242
TRA0009  2003-05-01   120    2      63017
TRA0010  2000-10-01   85     5      23712

Each row stands for one transaction. What I need to find are similar transactions to each transaction based on the following combination of "matching criteria" (AND connected):

Date must be within +/- 18 months, e.g. for TRA0001 the only match would be TRA 0006
Value must be within +/- 20% of the original row's value, e.g. matches for TRA0001 would be TRA0006 and TRA0009
Class must be an exact match, e.g. the matches for TRA0001 upon this criterion would be TRA0004 and TRA0006

Note that there can be no match, one match or multiple matches for each transaction/ row. What I need in the end is a list of matches in respect of the combination of the three criteria mentioned above.

For the given example, a result df2 would look like this:

ID       ID_Match   ZIP_Match
TRA0001  TRA0006    70139
TRA0003  TRA0005    14618
TRA0003  TRA0007    12883
TRA0005  TRA0007    12883
TRA0006  TRA0001    75019
TRA0007  TRA0003    12451
TRA0007  TRA0005    14618

So far, I tried various combinations of duplicate search to get closer to my desired outcome by fulfilling at least one matching criteria and next "filtering down" this result according to the other constraints. I started with the Class condition, as this seemed to me to be the easiest criterion (and probably also the most selective). All I came up in the end was e.g a list of all classes that have duplicates and there respective index positions where the duplicates can be found. For that I used the following code (found on stackoverflow, credits to user "eddi"):

dups = duplicated(df1$Class) | duplicated(d1$Class, fromLast = T)
split(which(dups), df1$Class[dups])

However, this still leaves me miles away from my desired result and I have no idea how to "integrate" the other conditions. Hope I could provide all the necessary information and could make clear my problem. Any kind of hints, suggestions or solutions is more than welcome! Thanks in advance!

Additionally: If someone comes up with an idea how to do the required work with Stata, this would also be welcome - I have slightly slightly more knowledge on Stata than on R.

Upvotes: 2

Answers (3)

Robert Picard

Reputation: 1051

There's a new user-written program called rangejoin (from SSC) that can be used to easily solve this problem in Stata. In order to use rangejoin, you also have to install rangestat (also from SSC). To install both, type in Stata's command window:

ssc install rangestat
ssc install rangejoin

rangejoin forms all pairwise combinations of observations that fall within a specified range. Since you want to match observations that have the same Class value, the join can be performed within Class groups. Since you have daily dates, I set up the solution to use a window of +/- 548 days (based on 365.25 days a year). Once all pairwise combinations are formed (within the specified time window for each observation), you can drop those that do not match your 20% threshold for Value.

Here's a fully functional example that uses your posted data:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 ID str10 Date int Value byte Class str5 ZIP
"TRA0001" "2007-09-25" 150 1 "75019"
"TRA0002" "2002-08-09" 200 2 "30152"
"TRA0003" "2010-08-31" 500 3 "12451"
"TRA0004" "2005-06-17"  75 1 "45242"
"TRA0005" "2010-08-26" 410 3 "14618"
"TRA0006" "2008-07-07" 155 1 "70139"
"TRA0007" "2010-01-15" 450 3 "12883"
"TRA0008" "2000-11-03"  80 4 "45242"
"TRA0009" "2003-05-01" 120 2 "63017"
"TRA0010" "2000-10-01"  85 5 "23712"
end

* convert string date to Stata numeric date
gen ndate = daily(Date, "YMD")
format %td ndate

* save a copy to disk
save "using_copy.do", replace

* match, within the same Class, obs +/- 18 months (365.25 * 1.5 =~ 548 days)
rangejoin ndate -548 548 using "using_copy.do", by(Class) suffix(_Match)

* drop matched ID if amount is off by 20% and match to self
drop if (abs(Value - Value_Match) / Value) > .2
drop if ID == ID_Match

* final results
sort ID ID_Match
list ID ID_Match ZIP_Match, sepby(ID) noobs

And the results:

. list ID ID_Match ZIP_Match, sepby(ID) noobs

  +-------------------------------+
  |      ID   ID_Match   ZIP_Ma~h |
  |-------------------------------|
  | TRA0001    TRA0006      70139 |
  |-------------------------------|
  | TRA0003    TRA0005      14618 |
  | TRA0003    TRA0007      12883 |
  |-------------------------------|
  | TRA0005    TRA0007      12883 |
  |-------------------------------|
  | TRA0006    TRA0001      75019 |
  |-------------------------------|
  | TRA0007    TRA0003      12451 |
  | TRA0007    TRA0005      14618 |
  +-------------------------------+

Upvotes: 1

bouncyball

Reputation: 10761

I think I found a way you can do it. Basically, we define a function that will do what you want for one ID, then use sapply to iterate through all the ID's, then use a call to rbind to put the results together.

The number of months function comes from @Dirk, in this post

df <- read.table(text = 
          "ID       Date         Value  Class  ZIP 
           TRA0001  2007-09-25   150    1      75019
           TRA0002  2002-08-09   200    2      30152
           TRA0003  2010-08-31   500    3      12451
           TRA0004  2005-06-17   75     1      45242
           TRA0005  2010-08-26   410    3      14618
           TRA0006  2008-07-07   155    1      70139
           TRA0007  2010-01-15   450    3      12883
           TRA0008  2000-11-03   80     4      45242
           TRA0009  2003-05-01   120    2      63017
           TRA0010  2000-10-01   85     5      23712", 
           header = T)
# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) { 
      lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
      lt$year*12 + lt$mon
    } 

# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }

find_fxn <- function(data, origID){
  #create subset with ID of interest
  orig_data <- subset(data, ID == origID)
  #subset of all other IDs
  other_data <- subset(data, ID != origID)
  #three matching criteria
  find_first <- which(abs(mondf(orig_data$Date, other_data$Date)) <= 18)
  find_second <- which(other_data$Value >= 0.8 * orig_data$Value & other_data$Value <= 1.2 * orig_data$Value)
  find_third <- which(other_data$Class == orig_data$Class)
  #use intersect to remove dups
  find_all <- intersect(intersect(find_first, find_second), find_third)
  if(length(find_all) > 0){
  cbind.data.frame(ID = orig_data$ID, 
                   IDMatch = other_data[find_all, 1],
                   ZipMatch = other_data[find_all, 5])
  }
}

do.call('rbind', sapply(df$ID, FUN = function(x) find_fxn(data = df, origID = x)))

       ID IDMatch ZipMatch
1 TRA0001 TRA0006    70139
2 TRA0003 TRA0005    14618
3 TRA0003 TRA0007    12883
4 TRA0005 TRA0007    12883
5 TRA0006 TRA0001    75019
6 TRA0007 TRA0003    12451
7 TRA0007 TRA0005    14618

Upvotes: 2

Abderyt

Reputation: 109

First of all use data.table package.

Then you can write simply function, that looks for all similar transaction for the provided one.

On the end loop your dataset to get all similar sets:

dt1 <- data.table::fread('ID       Date         Value  Class  ZIP 
TRA0001  2007-09-25   150    1      75019
TRA0002  2002-08-09   200    2      30152
TRA0003  2010-08-31   500    3      12451
TRA0004  2005-06-17   75     1      45242
TRA0005  2010-08-26   410    3      14618
TRA0006  2008-07-07   155    1      70139
TRA0007  2010-01-15   450    3      12883
TRA0008  2000-11-03   80     4      45242
TRA0009  2003-05-01   120    2      63017
TRA0010  2000-10-01   85     5      23712')

dt1[, Date:=as.POSIXct(Date)]
myTransaction <- dt1[1]
dt1[Class==myTransaction$Class & abs(difftime(Date, myTransaction$Date, units='weeks')) < 4*18 & abs((Value-myTransaction$Value)/pom$Value) < .2]

similar <- lapply(1:nrow(dt1), function(x)
  {
  myTransaction <- dt1[x]
  dt1[ID!=myTransaction$ID & Class==myTransaction$Class & abs(difftime(Date, myTransaction$Date, units='weeks')) < 4*18 & abs((Value-myTransaction$Value)/pom$Value) < .2]
})
names(similar) <- dt1$ID

Use similar[['TRA0006']] to check for similar transactions.

Upvotes: 0

R: Finding similiar/ &quot;duplicate&quot; PAIRS of entries based on multiple variables/ columns/ constraints

Answers (3)

Related Questions

R: Finding similiar/ "duplicate" PAIRS of entries based on multiple variables/ columns/ constraints