aguadamuz
aguadamuz

Reputation: 421

Comparing data frames in R

I am really new to R and stackoverflow; apologies in advance for issues with my question.

I have two data frames

data.frame 1:

Product.ID Description Wholesale.Price
Prod1      Desc1       1.45
Prod       Desc2       1.27
Prod3      Desc        3.62
Prod4      Desc4       2.15
Prod5      Desc5       2.87
Prod12     Desc6       2.53
Prod7      Desc7       2.20
Prod8      Desc8       2.60
Prod9      Desc9       3.68

data.frame 2:

Product.ID Description Wholesale.Price
Prod1      Desc1       1.45
Prod2      Desc2       1.27
Prod3      Desc3       3.62
Prod4      Desc4       1.57
Prod5      Desc5       2.87
Prod6      Desc6       2.53
Prod7      Desc7       2.20
Prod8      Desc8       3.21
Prod9      Desc9       1.81

I see that I can use merge(list_1, list_2) to print where all 3 columns of the two data frames match (which is very cool).

I am trying to find a means of printing out where there are discrepancies between Description and Wholesale.price between the two data frames based on Product.ID. I am not even sure how to visualize the discrepancies in a meaningful way.

Any assistance is most appreciated.

Upvotes: 3

Views: 234

Answers (3)

bgoldst
bgoldst

Reputation: 35314

I just wrote a function for someone the other day to perform this exact task. With a couple of modifications, it can be used here:

df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );
df1;
##   Product.ID Description Wholesale.Price
## 1      Prod1       Desc1            1.45
## 2       Prod       Desc2            1.27
## 3      Prod3        Desc            3.62
## 4      Prod4       Desc4            2.15
## 5      Prod5       Desc5            2.87
## 6     Prod12       Desc6            2.53
## 7      Prod7       Desc7            2.20
## 8      Prod8       Desc8            2.60
## 9      Prod9       Desc9            3.68
df2;
##   Product.ID Description Wholesale.Price
## 1      Prod1       Desc1            1.45
## 2      Prod2       Desc2            1.27
## 3      Prod3       Desc3            3.62
## 4      Prod4       Desc4            1.57
## 5      Prod5       Desc5            2.87
## 6      Prod6       Desc6            2.53
## 7      Prod7       Desc7            2.20
## 8      Prod8       Desc8            3.21
## 9      Prod9       Desc9            1.81
compare <- function(d1,d2,idcol='id',cols=setdiff(intersect(colnames(d1),colnames(d2)),idcol)) {
    com <- intersect(d1[[idcol]],d2[[idcol]]);
    d1com <- match(com,d1[[idcol]]);
    d2com <- match(com,d2[[idcol]]);
    setNames(lapply(cols,function(col) com[d1[[col]][d1com]!=d2[[col]][d2com]]),cols);
}; cmp <- compare(df1,df2,'Product.ID'); cmp;
## $Description
## [1] "Prod3"
##
## $Wholesale.Price
## [1] "Prod4" "Prod8" "Prod9"

cmp now contains a vector of Product.IDs that differ between the two data.frames, one vector for each non-key column. You can display the actual differences by subsetting on those vectors and merging the results:

merge(subset(df1,Product.ID%in%cmp$Description),subset(df2,Product.ID%in%cmp$Description),by='Product.ID');
##   Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1      Prod3          Desc              3.62         Desc3              3.62
merge(subset(df1,Product.ID%in%cmp$Wholesale.Price),subset(df2,Product.ID%in%cmp$Wholesale.Price),by='Product.ID');
##   Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1      Prod4         Desc4              2.15         Desc4              1.57
## 2      Prod8         Desc8              2.60         Desc8              3.21
## 3      Prod9         Desc9              3.68         Desc9              1.81

An advantage of this solution is that it avoids merging the entire contents of the input data.frames prior to calculating the discrepancies. Such a merge is unnecessary and wasteful of CPU and memory, which could be significant for large inputs.

Upvotes: 2

jeremycg
jeremycg

Reputation: 24945

Here is a quick two liner. First read in the data from @bgoldst:

df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );

Now we want to merge it, but keep all the columns:

x <- merge(df1, df2, by = "Product.ID")

Now print out the columns with those that have a mismatch in price or description:

x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]


  Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
2      Prod3          Desc              3.62         Desc3              3.62
3      Prod4         Desc4              2.15         Desc4              1.57
6      Prod8         Desc8              2.60         Desc8              3.21
7      Prod9         Desc9              3.68         Desc9              1.81

Upvotes: 6

Gregor Thomas
Gregor Thomas

Reputation: 145755

Let's rename the columns you want to compare:

names(list_1)[3] = "Price1"
names(list_2)[3] = "Price2"

Now we can merge and keep both price columns.

list_both = merge(list_1, list_2)

# calculate differences
list_both$difference = list_both$Price1 - list_both$Price2

# look at the top of the data
head(list_both)

# print out those with a difference
list_both[list_both$difference != 0, ]

For visualization, I'll leave you to explore a little on your own from here.

Upvotes: 3

Related Questions