sr1987
sr1987

Reputation: 23

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

I want the output as two files as below :

Based on string abc, I want to split the file.

file 1:

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~

file 2:

abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.

Upvotes: 1

Views: 2713

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)

val rdd = sc.wholeTextFiles("path to the file")
  .flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()

And then loop the array to save the texts to output

for(str <- rdd){
  //saving codes here
}

Upvotes: 0

tricky
tricky

Reputation: 1553

There is this little dirty trick that I used for a project :

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")

You can set the delimiter of your spark context for reading files. So you could do something like this :

val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
           .map(x => (delimit ++ x))
           .toDF("delimit_column")
           .filter(col("delimit_column") !== delimit)

Then you can map each element of your DataFrame (or RDD) to be written to a file.

It's a dirty method but it might help you !

Have a good day

PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter

Upvotes: 3

Related Questions