Reputation: 23
I am relatively new to R and absolutely new to stackoverflow (having researched a lot here anyway as I have some prior experience in Stata, Excel, VBA and little C).
I have a R dataframe df1 that looks like the following example, just with a few thousand rows:
ID Date Value Class ZIP
TRA0001 2007-09-25 150 1 75019
TRA0002 2002-08-09 200 2 30152
TRA0003 2010-08-31 500 3 12451
TRA0004 2005-06-17 75 1 45242
TRA0005 2010-08-26 410 3 14618
TRA0006 2008-07-07 155 1 70139
TRA0007 2010-01-15 450 3 12883
TRA0008 2000-11-03 80 4 45242
TRA0009 2003-05-01 120 2 63017
TRA0010 2000-10-01 85 5 23712
Each row stands for one transaction. What I need to find are similar transactions to each transaction based on the following combination of "matching criteria" (AND connected):
Note that there can be no match, one match or multiple matches for each transaction/ row. What I need in the end is a list of matches in respect of the combination of the three criteria mentioned above.
For the given example, a result df2 would look like this:
ID ID_Match ZIP_Match
TRA0001 TRA0006 70139
TRA0003 TRA0005 14618
TRA0003 TRA0007 12883
TRA0005 TRA0007 12883
TRA0006 TRA0001 75019
TRA0007 TRA0003 12451
TRA0007 TRA0005 14618
So far, I tried various combinations of duplicate search to get closer to my desired outcome by fulfilling at least one matching criteria and next "filtering down" this result according to the other constraints. I started with the Class condition, as this seemed to me to be the easiest criterion (and probably also the most selective). All I came up in the end was e.g a list of all classes that have duplicates and there respective index positions where the duplicates can be found. For that I used the following code (found on stackoverflow, credits to user "eddi"):
dups = duplicated(df1$Class) | duplicated(d1$Class, fromLast = T)
split(which(dups), df1$Class[dups])
However, this still leaves me miles away from my desired result and I have no idea how to "integrate" the other conditions. Hope I could provide all the necessary information and could make clear my problem. Any kind of hints, suggestions or solutions is more than welcome! Thanks in advance!
Additionally: If someone comes up with an idea how to do the required work with Stata, this would also be welcome - I have slightly slightly more knowledge on Stata than on R.
Upvotes: 2
Views: 1423
Reputation: 1051
There's a new user-written program called rangejoin
(from SSC) that can be used to easily solve this problem in Stata. In order to use rangejoin
, you also have to install rangestat
(also from SSC). To install both, type in Stata's command window:
ssc install rangestat
ssc install rangejoin
rangejoin
forms all pairwise combinations of observations that fall within a specified range. Since you want to match observations that have the same Class value, the join can be performed within Class groups. Since you have daily dates, I set up the solution to use a window of +/- 548 days (based on 365.25 days a year). Once all pairwise combinations are formed (within the specified time window for each observation), you can drop those that do not match your 20% threshold for Value.
Here's a fully functional example that uses your posted data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 ID str10 Date int Value byte Class str5 ZIP
"TRA0001" "2007-09-25" 150 1 "75019"
"TRA0002" "2002-08-09" 200 2 "30152"
"TRA0003" "2010-08-31" 500 3 "12451"
"TRA0004" "2005-06-17" 75 1 "45242"
"TRA0005" "2010-08-26" 410 3 "14618"
"TRA0006" "2008-07-07" 155 1 "70139"
"TRA0007" "2010-01-15" 450 3 "12883"
"TRA0008" "2000-11-03" 80 4 "45242"
"TRA0009" "2003-05-01" 120 2 "63017"
"TRA0010" "2000-10-01" 85 5 "23712"
end
* convert string date to Stata numeric date
gen ndate = daily(Date, "YMD")
format %td ndate
* save a copy to disk
save "using_copy.do", replace
* match, within the same Class, obs +/- 18 months (365.25 * 1.5 =~ 548 days)
rangejoin ndate -548 548 using "using_copy.do", by(Class) suffix(_Match)
* drop matched ID if amount is off by 20% and match to self
drop if (abs(Value - Value_Match) / Value) > .2
drop if ID == ID_Match
* final results
sort ID ID_Match
list ID ID_Match ZIP_Match, sepby(ID) noobs
And the results:
. list ID ID_Match ZIP_Match, sepby(ID) noobs
+-------------------------------+
| ID ID_Match ZIP_Ma~h |
|-------------------------------|
| TRA0001 TRA0006 70139 |
|-------------------------------|
| TRA0003 TRA0005 14618 |
| TRA0003 TRA0007 12883 |
|-------------------------------|
| TRA0005 TRA0007 12883 |
|-------------------------------|
| TRA0006 TRA0001 75019 |
|-------------------------------|
| TRA0007 TRA0003 12451 |
| TRA0007 TRA0005 14618 |
+-------------------------------+
Upvotes: 1
Reputation: 10761
I think I found a way you can do it. Basically, we define a function that will do what you want for one ID, then use sapply
to iterate through all the ID's, then use a call to rbind
to put the results together.
The number of months function comes from @Dirk, in this post
df <- read.table(text =
"ID Date Value Class ZIP
TRA0001 2007-09-25 150 1 75019
TRA0002 2002-08-09 200 2 30152
TRA0003 2010-08-31 500 3 12451
TRA0004 2005-06-17 75 1 45242
TRA0005 2010-08-26 410 3 14618
TRA0006 2008-07-07 155 1 70139
TRA0007 2010-01-15 450 3 12883
TRA0008 2000-11-03 80 4 45242
TRA0009 2003-05-01 120 2 63017
TRA0010 2000-10-01 85 5 23712",
header = T)
# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }
find_fxn <- function(data, origID){
#create subset with ID of interest
orig_data <- subset(data, ID == origID)
#subset of all other IDs
other_data <- subset(data, ID != origID)
#three matching criteria
find_first <- which(abs(mondf(orig_data$Date, other_data$Date)) <= 18)
find_second <- which(other_data$Value >= 0.8 * orig_data$Value & other_data$Value <= 1.2 * orig_data$Value)
find_third <- which(other_data$Class == orig_data$Class)
#use intersect to remove dups
find_all <- intersect(intersect(find_first, find_second), find_third)
if(length(find_all) > 0){
cbind.data.frame(ID = orig_data$ID,
IDMatch = other_data[find_all, 1],
ZipMatch = other_data[find_all, 5])
}
}
do.call('rbind', sapply(df$ID, FUN = function(x) find_fxn(data = df, origID = x)))
ID IDMatch ZipMatch
1 TRA0001 TRA0006 70139
2 TRA0003 TRA0005 14618
3 TRA0003 TRA0007 12883
4 TRA0005 TRA0007 12883
5 TRA0006 TRA0001 75019
6 TRA0007 TRA0003 12451
7 TRA0007 TRA0005 14618
Upvotes: 2
Reputation: 109
First of all use data.table package.
Then you can write simply function, that looks for all similar transaction for the provided one.
On the end loop your dataset to get all similar sets:
dt1 <- data.table::fread('ID Date Value Class ZIP
TRA0001 2007-09-25 150 1 75019
TRA0002 2002-08-09 200 2 30152
TRA0003 2010-08-31 500 3 12451
TRA0004 2005-06-17 75 1 45242
TRA0005 2010-08-26 410 3 14618
TRA0006 2008-07-07 155 1 70139
TRA0007 2010-01-15 450 3 12883
TRA0008 2000-11-03 80 4 45242
TRA0009 2003-05-01 120 2 63017
TRA0010 2000-10-01 85 5 23712')
dt1[, Date:=as.POSIXct(Date)]
myTransaction <- dt1[1]
dt1[Class==myTransaction$Class & abs(difftime(Date, myTransaction$Date, units='weeks')) < 4*18 & abs((Value-myTransaction$Value)/pom$Value) < .2]
similar <- lapply(1:nrow(dt1), function(x)
{
myTransaction <- dt1[x]
dt1[ID!=myTransaction$ID & Class==myTransaction$Class & abs(difftime(Date, myTransaction$Date, units='weeks')) < 4*18 & abs((Value-myTransaction$Value)/pom$Value) < .2]
})
names(similar) <- dt1$ID
Use similar[['TRA0006']]
to check for similar transactions.
Upvotes: 0