Spark cluster Adding Number in sequence to every line in a file

Question

I have a file which contains names in each line , I want to add numbers in sequence to each line. For eg if a file is like this

a  
b
c
d

I want it to achieve this

a,1
b,2
c,3
d,4

I have write this code to achieve this

val lines = sc.textFile("data.txt")
val pair = lines.zipWithIndex().map{case(i,line) => i.toString +","+line}
pair.collect()

But As you know , Spark distributes it task across different clusters. So I am not sure that this will work.So can anyone please tell me how can i achieve this? Thanks in Advance.

siddhartha jain · Accepted Answer

If you will run this code you will get an output you are expecting. Even when spark distributes its task across cluster but that do not affect anything programmatically. In case of your example if you are running with 2 worker node then file will be divided into two partitions which will be stored on each respective worker node. Now when program will run and when driver will come across zipWithIndex it will make sure that both the worker will have information about the other partition of the file since it is requirement of zipWithIndex.

In spark different transformation and actions have different requirements and master node make sure that those requirements are fulfilled like distinct need the shuffling of data to make sure that there is only one copy.

Another thing if you only want to make pair of word with line number then you do not need map. Only this will also work

pair = lines.zipWithIndex();

I ran the example in java with the above line of code without map and it gave me correct output. Although line number started with 0.But still it proves the point that number of worker node will not have any effect on displaying line number in order.

// output of worker 1 part-00000
    a,0
    b,1

//output of worker 2 part-00001
    c,2
    d,3

Spark cluster Adding Number in sequence to every line in a file

Answers (1)

Related Questions