Scala Nested Iteration within RDD

Question

I have to iterate through all columns to find similarity of 1 column value. For example:

ID,FN,LN,Phone
-----------
1,James,Butt,872-232-1212
2,Josephine,Darakjy, 872-232-1213
3,Art,Venere,872-232-1214
4,Lenna,Paprocki,872-232-1215
5,Donette, Foller,872-232-1216
6,Jmes,Butt,666-232-1212
7,Donette, Foller,888-232-1216
8,Josphne,Darkjy, 555-232-1213

Inside the loop, I will take FN, which is 'James' and see if I have similar name in the complete data set using some kind string distances (e.g Levenshtein) and in this case I have match with ID#6: 'Jmes', I will create a bucket by adding a new GUID column this:

ID,FN,LN,Phone,GrupId
----------------------
1,James,Butt,872-232-1212,G1
2,Josephine,Darakjy, 872-232-1213,G2
3,Art,Venere,872-232-1214,G3
4,Lenna,Paprocki,872-232-1215,G4
5,Donette, Foller,872-232-1216,G5
6,Jmes,Butt,666-232-1212,G1
7,Donette, Foller,888-232-1216,G5
8,Josphne,Darkjy, 555-232-1213,G2

I have to do same operation on multiple columns, like LN, Phone as well. Imagine if I have 1 million records.

Any thoughts, suggestions or links are appreciated. Thank you!

Scala Nested Iteration within RDD

Answers (1)

Related Questions