marcamillion
marcamillion

Reputation: 33755

How do I remove duplicate rows in my CSV?

I have a CSV that has data like this:

A.A.B. Direct   http://www.aabdirect.com    348 Willis Ave  Mineola NY  11501   (800) 382-1002  no email
Abeam Consulting Inc    http://abeam.com    245 Park Ave    New York    NY  10167   (212) 372-8783  no email
Abeam Consulting Inc    http://abeam.com    245 Park Ave    New York    NY  10167   (212) 372-8783  no email
Alvarez & Marsal    http://www.alvarezandmarsal.com 600 Madison Ave New York    NY  10022   (212) 759-4433  no email
Alvarez & Marsal    http://www.alvarezandmarsal.com 600 Lexington Ave Ste 6 New York    NY  10022   (212) 759-4433  no email

The key thing here is that sometimes all columns in both rows match (like Abeam Consulting Inc), but sometimes that's not the case. Sometimes just the websites match, or the phone number or the name match.

The key thing is the website. If two values have the same website, I only want one.

How do I de-dupe this list in a non N+1 way?

Preferably with some native ruby method like .uniq or something of the sort.

Upvotes: 0

Views: 383

Answers (1)

Cary Swoveland
Cary Swoveland

Reputation: 110685

Just read those strings (which I"ve simplified to avoid the need for horizontal scrolling) into an array:

arr = [
  "A.A.B. Direct   http://www.aabdirect.com    (800) 382-1002",
  "Abeam Consulting Inc    http://abeam.com    (212) 372-8783",
  "Abeam Consulting Inc    http://abeam.com    (212) 372-8783",
  "Alvarez & Marsal    http://www.alvarezandmarsal.com (212) 759-4433",
  "Alvarez & Marsal    http://www.alvarezandmarsal.com 10022   (212) 759-4433"
]

and, as you suggest, use Array#uniq, but with a block:

arr.uniq { |line| line[/\shttp:\S+/] }
  #=> ["A.A.B. Direct   http://www.aabdirect.com    (800) 382-1002",
  #    "Abeam Consulting Inc    http://abeam.com    (212) 372-8783",
  #    "Alvarez & Marsal    http://www.alvarezandmarsal.com (212) 759-4433"]

See Array#uniq. The regex /\shttp:\S+/ reads, "match a whitespace followed by the string "http:", followed by one or more characters other than whitespaces (greedily)".

Upvotes: 2

Related Questions