eozzy
eozzy

Reputation: 68710

Finding Duplicates (Regex)

I have a CSV containing list of 500 members with their phone numbers. I tried diff tools but none can seem to find duplicates.

Can I use regex to find duplicate rows by members' phone numbers?

I'm using Textmate on Mac.

Many thanks

Upvotes: 2

Views: 2765

Answers (5)

eumiro
eumiro

Reputation: 213005

What duplicates are you searching for? The whole lines or just the same phone number?

If it is the whole line, then try this:

sort phonelist.txt | uniq -c | sort -n

and you will see at the bottom all lines, that occur more than once.

If it is just the phone number in some column, then use this:

awk -F ';' '{print $4}' phonelist.txt | uniq -c | sort -n

replace the '4' with the number of the column with the phone number and the ';' with the real separator you are using in your file.

Or give us a few example lines from this file.

EDIT:

If the data format is: name,mobile,phone,uniqueid,group, then use the following:

awk -F ',' '{print $3}' phonelist.txt | uniq -c | sort -n

in the command line.

Upvotes: 4

Ruel
Ruel

Reputation: 15780

use PERL.

Load the CSV file into an array, and match the column you want to check (phone numbers) for duplicates, then store the values into another array, then check for duplicates in that array, using:

my %seen;
my @unique = grep !$seen{$_}++, @array2;

After that, all you need to do is load the unique array(phone numbers) into a for loop, and inside it load array#1(lines) into a for loop. Compare the phone number in the unique array, and if it matches, output that line into another csv file.

Upvotes: 0

Robusto
Robusto

Reputation: 31883

Yes. For one way to do it, look here. But you would probably not want to do it this way.

Upvotes: 2

Ryan Rodemoyer
Ryan Rodemoyer

Reputation: 5692

What language are you using? In .NET, with little effort you could load the CSV file in to a DataTable and find/remove the duplicate rows. Afterwards, write your DataTable back to another CSV file.

Heck, you can load this file in to Excel and sort by a field and find the duplicates manually. 500 isn't THAT many.

Upvotes: 0

Svisstack
Svisstack

Reputation: 16636

You can normally parse this file, and check what rows are duplicated. I think RAGEX is a worst solution for this problem.

Upvotes: 0

Related Questions