Reputation: 421
I am really new to R and stackoverflow; apologies in advance for issues with my question.
I have two data frames
data.frame 1:
Product.ID Description Wholesale.Price
Prod1 Desc1 1.45
Prod Desc2 1.27
Prod3 Desc 3.62
Prod4 Desc4 2.15
Prod5 Desc5 2.87
Prod12 Desc6 2.53
Prod7 Desc7 2.20
Prod8 Desc8 2.60
Prod9 Desc9 3.68
data.frame 2:
Product.ID Description Wholesale.Price
Prod1 Desc1 1.45
Prod2 Desc2 1.27
Prod3 Desc3 3.62
Prod4 Desc4 1.57
Prod5 Desc5 2.87
Prod6 Desc6 2.53
Prod7 Desc7 2.20
Prod8 Desc8 3.21
Prod9 Desc9 1.81
I see that I can use merge(list_1, list_2) to print where all 3 columns of the two data frames match (which is very cool).
I am trying to find a means of printing out where there are discrepancies between Description and Wholesale.price between the two data frames based on Product.ID. I am not even sure how to visualize the discrepancies in a meaningful way.
Any assistance is most appreciated.
Upvotes: 3
Views: 234
Reputation: 35314
I just wrote a function for someone the other day to perform this exact task. With a couple of modifications, it can be used here:
df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );
df1;
## Product.ID Description Wholesale.Price
## 1 Prod1 Desc1 1.45
## 2 Prod Desc2 1.27
## 3 Prod3 Desc 3.62
## 4 Prod4 Desc4 2.15
## 5 Prod5 Desc5 2.87
## 6 Prod12 Desc6 2.53
## 7 Prod7 Desc7 2.20
## 8 Prod8 Desc8 2.60
## 9 Prod9 Desc9 3.68
df2;
## Product.ID Description Wholesale.Price
## 1 Prod1 Desc1 1.45
## 2 Prod2 Desc2 1.27
## 3 Prod3 Desc3 3.62
## 4 Prod4 Desc4 1.57
## 5 Prod5 Desc5 2.87
## 6 Prod6 Desc6 2.53
## 7 Prod7 Desc7 2.20
## 8 Prod8 Desc8 3.21
## 9 Prod9 Desc9 1.81
compare <- function(d1,d2,idcol='id',cols=setdiff(intersect(colnames(d1),colnames(d2)),idcol)) {
com <- intersect(d1[[idcol]],d2[[idcol]]);
d1com <- match(com,d1[[idcol]]);
d2com <- match(com,d2[[idcol]]);
setNames(lapply(cols,function(col) com[d1[[col]][d1com]!=d2[[col]][d2com]]),cols);
}; cmp <- compare(df1,df2,'Product.ID'); cmp;
## $Description
## [1] "Prod3"
##
## $Wholesale.Price
## [1] "Prod4" "Prod8" "Prod9"
cmp
now contains a vector of Product.ID
s that differ between the two data.frames, one vector for each non-key column. You can display the actual differences by subsetting on those vectors and merging the results:
merge(subset(df1,Product.ID%in%cmp$Description),subset(df2,Product.ID%in%cmp$Description),by='Product.ID');
## Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1 Prod3 Desc 3.62 Desc3 3.62
merge(subset(df1,Product.ID%in%cmp$Wholesale.Price),subset(df2,Product.ID%in%cmp$Wholesale.Price),by='Product.ID');
## Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1 Prod4 Desc4 2.15 Desc4 1.57
## 2 Prod8 Desc8 2.60 Desc8 3.21
## 3 Prod9 Desc9 3.68 Desc9 1.81
An advantage of this solution is that it avoids merging the entire contents of the input data.frames prior to calculating the discrepancies. Such a merge is unnecessary and wasteful of CPU and memory, which could be significant for large inputs.
Upvotes: 2
Reputation: 24945
Here is a quick two liner. First read in the data from @bgoldst:
df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );
Now we want to merge it, but keep all the columns:
x <- merge(df1, df2, by = "Product.ID")
Now print out the columns with those that have a mismatch in price or description:
x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]
Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
2 Prod3 Desc 3.62 Desc3 3.62
3 Prod4 Desc4 2.15 Desc4 1.57
6 Prod8 Desc8 2.60 Desc8 3.21
7 Prod9 Desc9 3.68 Desc9 1.81
Upvotes: 6
Reputation: 145755
Let's rename the columns you want to compare:
names(list_1)[3] = "Price1"
names(list_2)[3] = "Price2"
Now we can merge and keep both price columns.
list_both = merge(list_1, list_2)
# calculate differences
list_both$difference = list_both$Price1 - list_both$Price2
# look at the top of the data
head(list_both)
# print out those with a difference
list_both[list_both$difference != 0, ]
For visualization, I'll leave you to explore a little on your own from here.
Upvotes: 3