user6643591
user6643591

Reputation:

Selecting particular column using Spark

i am having a file in hdfs which is comma (,) separated, I am trying to extract 6th column using scala for that i have written below code

object WordCount {
 def main(args: Array[String])
 {
 val textfile = sc.textFile("/user/cloudera/xxx/xxx")
 val word = textfile.filter( x => x.length >  0 ).map(_.replaceAll("\\|",",").trim)
 val keys = word.map(a => a(5))
 keys.saveAsTextFile("/user/cloudera/xxx/Sparktest")
 }
}

but the result i am getting in HDFS is not what i want.

Previously my data was :

MSH|^~\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2
PID|1|xxxxx|xxxx||TEST|Rooney|19761202|M|MR^^M^MR^MD^11|7|0371 HOES LANE^0371

Now my data is :

\
T
I
,
1
N
\
T
I
,
1
N
\
T
I

I want my result yo be :

BIN
TEST 

I don't know what i am doing wrong. Please help

Upvotes: 1

Views: 570

Answers (3)

Sandeep Purohit
Sandeep Purohit

Reputation: 3692

In spark 2.0 now you have csv reader so you can simple load csv as follows

val baseDS=spark.read.option("header","true").csv(filePath)
  baseDS.show()

You can select column by simply its name as follows

val selectCol = baseDS.select(ColumnName)

Upvotes: 0

Tzach Zohar
Tzach Zohar

Reputation: 37832

You're replacing | with ,, but you're not splitting by comma, so word still has type RDD[String], and not RDD[Array[String]] as you seem to expect. Then, a => a(5) treats each string as an array of chars, thus the result you're seeing.

Not sure why you'd replace the pipes with commas in the first place, you can just:

val word = textfile.filter(x => x.length >  0).map(_.split('|'))
val keys = word.map(a => a(5).trim)

Upvotes: 2

wmoco_6725
wmoco_6725

Reputation: 3169

Use the 'split()' function!

val s="MSH|^~\\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2"

// WRONG
s.replaceAll("\\|",",")(5)   
res3: Char = ~

// RIGHT
s.split("\\|")(5) 
res4: String = BIN

Upvotes: 0

Related Questions