Timothy Maness
Timothy Maness

Reputation: 13

Excel function to find duplicate rows in large table

I have a table that's multiple thousand rows long, and I'd like to find duplicate rows. I've used countifs(), and it's functional, but it's also very resource-intensive, and I'd like to find a way to do this in a less expensive way.

An example of the data I'd like to check is as follows

Date      Product Name     Quantity    Delivered to
10-1-15   Toilet Paper     1           Health Science
10-1-15   Paper Towel      1           Health Science
10-1-15   Can Liners       2           Health Science
10-2-15   Toilet Paper     2           Administration
10-2-15   Paper Towel      1           Health Science

The formula I've used is

=(countifs(A:A,A2,B:B,B2,C:C,C2,D:D,D2))

but this takes multiple minutes to evaluate with the amount of data I have. I'd like to set up a formula that only compares rows with the same date, so earlier rows don't have to recalculate, and later ones don't have to account for earlier additions.

The purpose behind this is to ensure that users of this sheet don't double-enter lines. I do a manual check for duplicates every week, as this is a tool for data mining our paper sign-out logs from a storage warehouse (we're aware it's an outdated system, but it's what we have to work with for now), but it'd make it easier to check my data if the sheet could alert me to possible duplications (especially without it taking 3 to 5 minutes to enter each line).

In addition, I'd prefer to use a formula, rather than VBA, as I have a hard enough time training people to use tools on a network drive without a scary warning about unsafe scripts.

Upvotes: 1

Views: 5046

Answers (3)

EEM
EEM

Reputation: 6659

Formula - Find duplicates

Assuming the data is located in the range A1:D1017 (change as required)

The solutions I propose include the use of two working fields as follows:

A. In E1 enter Key and in E2 this formula to concatenate all the fields to be validated, then copy the formula till the last record

=CONCATENATE( A2 , CHAR(133) , B2 , CHAR(133) , C2 , CHAR(133) , D2 ) 

CHAR(133) character use as separator, omit or change it as required.

B. Second field, called !Chk, used to flag the Duplicated records. However, as this solution includes three options, we’ll have for this purpose 3 fields Chk.1, Chk.2 and Chk.3 in cells F1, G1 and H1 respectively.

  1. Flag all Duplicated records (including original) Enter this formula in cell F2 and copy till last record

    =IF( COUNTIF( $E$1:$E$1017, $E2 ) = 1, "", "Duplicated" )
    
  2. Flag only Duplicated records, leaves unmarked the first record found.

    =IF( COUNTIF( $E$1:$E2, $E2 ) = 1, "", "Duplicated" )
    
  3. Flag only Duplicated records, flag the firs record found as Original.

    =IF( COUNTIF( $E$1:$E$1017, $E2 ) = 1, "",
    IF( COUNTIF( $E$1:$E2, $E2 ) = 1, "Original", "Duplicated" ) )
    

enter image description here

Hide column F if preferable

Upvotes: 1

Timothy Maness
Timothy Maness

Reputation: 13

I wrote a terrible bit of excel formula that does the job. It's nearly unreadable, but it doesn't recalculate based on the entire table anymore. I've entered this formula into a column to the right of my data entry, and it puts "!!" into the field if this row duplicates a previous one. It also checks that I've entered a complete row of data, to prevent it evaluating as I'm entering data.

=IF(
 COUNTA(A1017:D1017)=4,
 IF(
  COUNTIFS(
   INDIRECT(IF(ROW(<26,"A2:A"&ROW()+25,"A"&ROW()-25&":A"&ROW()+25)),$A1017,
   INDIRECT(IF(ROW(<26,"B2:B"&ROW()+25,"B"&ROW()-25&":B"&ROW()+25)),$B1017,
   INDIRECT(IF(ROW(<26,"C2:C"&ROW()+25,"C"&ROW()-25&":C"&ROW()+25)),$C1017,
   INDIRECT(IF(ROW(<26,"D2:D"&ROW()+25,"D"&ROW()-25&":D"&ROW()+25)),$D1017
  )
 >1,"!!","")
,"")

What it does is it checks the 25 rows above and below for duplicates, as 25 items is larger than the maximum number of entries added in any one day. The upshot is that this takes less than a second to indicate whether I've entered a duplicate line.

I'm going to try the solution posited by pnuts in the comments

You might get more speed if you check against a PivotTable filtered for the newly entered date, rather than against your raw data.

This seems like a much more elegant solution, even though it doesn't indicate which line is the culprit, it does give me the data necessary to find the offending line.

Upvotes: 0

jsh
jsh

Reputation: 338

This would be trivial with SQL. You can query Excel using SQL, but I am not sure what the performance would be like. Take a look here and see if SQL would not help.

Upvotes: 0

Related Questions