Satya
Satya

Reputation: 153

How to process tab-separated files in Spark?

I have a file which is tab separated. The third column should be my key and the entire record should be my value (as per Map reduce concept).

val cefFile = sc.textFile("C:\\text1.txt")
val cefDim1 =  cefFile.filter { line => line.startsWith("1") }
val joinedRDD = cefFile.map(x => x.split("\\t")) 
joinedRDD.first().foreach { println } 

I am able to get the value of first column but not third. Can anyone suggest me how I could accomplish this?

Upvotes: 2

Views: 7522

Answers (1)

Holden
Holden

Reputation: 7452

After you've done the split x.split("\\t") your rdd (which in your example you called joinedRDD but I'm going to call it parsedRDD since we haven't joined it with anything yet) is going to be an RDD of Arrays. We could turn this into an array of key/value tuples by doing parsedRDD.map(r => (r(2), r)). That being said - you aren't limited to just map & reduce operations in Spark so its possible that another data structure might be better suited. Also for tab separated files, you could use spark-csv along with Spark DataFrames if that is a good fit for the eventual problem you are looking to solve.

Upvotes: 2

Related Questions