Reputation:
i am having a file in hdfs which is comma (,)
separated, I am trying to extract 6th column using scala for that i have written below code
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val word = textfile.filter( x => x.length > 0 ).map(_.replaceAll("\\|",",").trim)
val keys = word.map(a => a(5))
keys.saveAsTextFile("/user/cloudera/xxx/Sparktest")
}
}
but the result i am getting in HDFS is not what i want.
Previously my data was :
MSH|^~\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2
PID|1|xxxxx|xxxx||TEST|Rooney|19761202|M|MR^^M^MR^MD^11|7|0371 HOES LANE^0371
Now my data is :
\
T
I
,
1
N
\
T
I
,
1
N
\
T
I
I want my result yo be :
BIN
TEST
I don't know what i am doing wrong. Please help
Upvotes: 1
Views: 570
Reputation: 3692
In spark 2.0 now you have csv reader so you can simple load csv as follows
val baseDS=spark.read.option("header","true").csv(filePath)
baseDS.show()
You can select column by simply its name as follows
val selectCol = baseDS.select(ColumnName)
Upvotes: 0
Reputation: 37832
You're replacing |
with ,
, but you're not splitting by comma, so word
still has type RDD[String]
, and not RDD[Array[String]]
as you seem to expect. Then, a => a(5)
treats each string as an array of chars, thus the result you're seeing.
Not sure why you'd replace the pipes with commas in the first place, you can just:
val word = textfile.filter(x => x.length > 0).map(_.split('|'))
val keys = word.map(a => a(5).trim)
Upvotes: 2
Reputation: 3169
Use the 'split()' function!
val s="MSH|^~\\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2"
// WRONG
s.replaceAll("\\|",",")(5)
res3: Char = ~
// RIGHT
s.split("\\|")(5)
res4: String = BIN
Upvotes: 0