Reputation: 33755
I have a CSV that has data like this:
A.A.B. Direct http://www.aabdirect.com 348 Willis Ave Mineola NY 11501 (800) 382-1002 no email
Abeam Consulting Inc http://abeam.com 245 Park Ave New York NY 10167 (212) 372-8783 no email
Abeam Consulting Inc http://abeam.com 245 Park Ave New York NY 10167 (212) 372-8783 no email
Alvarez & Marsal http://www.alvarezandmarsal.com 600 Madison Ave New York NY 10022 (212) 759-4433 no email
Alvarez & Marsal http://www.alvarezandmarsal.com 600 Lexington Ave Ste 6 New York NY 10022 (212) 759-4433 no email
The key thing here is that sometimes all columns in both rows match (like Abeam Consulting Inc
), but sometimes that's not the case. Sometimes just the websites match, or the phone number or the name match.
The key thing is the website. If two values have the same website, I only want one.
How do I de-dupe this list in a non N+1 way?
Preferably with some native ruby method like .uniq
or something of the sort.
Upvotes: 0
Views: 383
Reputation: 110685
Just read those strings (which I"ve simplified to avoid the need for horizontal scrolling) into an array:
arr = [
"A.A.B. Direct http://www.aabdirect.com (800) 382-1002",
"Abeam Consulting Inc http://abeam.com (212) 372-8783",
"Abeam Consulting Inc http://abeam.com (212) 372-8783",
"Alvarez & Marsal http://www.alvarezandmarsal.com (212) 759-4433",
"Alvarez & Marsal http://www.alvarezandmarsal.com 10022 (212) 759-4433"
]
and, as you suggest, use Array#uniq, but with a block:
arr.uniq { |line| line[/\shttp:\S+/] }
#=> ["A.A.B. Direct http://www.aabdirect.com (800) 382-1002",
# "Abeam Consulting Inc http://abeam.com (212) 372-8783",
# "Alvarez & Marsal http://www.alvarezandmarsal.com (212) 759-4433"]
See Array#uniq. The regex /\shttp:\S+/
reads, "match a whitespace followed by the string "http:"
, followed by one or more characters other than whitespaces (greedily)".
Upvotes: 2