Reputation: 1596
I'm parsing a CSV file without new line signs:
"line1field1", "line1field2", "line1field3", "line2field1", "line2field2", "line2field3", "line3field1", "line3field2", "line3field3"
Is it possible to do it effectively in Spark? (I would like to get DataSet with 3 rows, 3 fields in each)
Upvotes: 1
Views: 580
Reputation: 91
If you want to do a Spark related way this should work. I just imported the csv file through the resources folder, but put in the string of wherever it exists for you.
import sqlContext.implicits._
val columnNames: Seq[String] = Seq("Col1","Col2","Col3")
sparkContext.textFile(this.getClass.getResource("/test.csv").toString) // your string location here
.map(x => x.split(',').sliding(3, 3))
.flatMap(x => x)
.map(x => x.toList)
.map { case List(a, b, c) => (a, b, c) } //cleanup needed here to convert to Tuple
.toDF(columnNames: _*)
.show(truncate = false)
This yields:
+-----------+-----------+-----------+
|Col1 |Col2 |Col3 |
+-----------+-----------+-----------+
|line1field1|line1field2|line1field3|
|line2field1|line2field2|line2field3|
|line3field1|line3field2|line3field3|
+-----------+-----------+-----------+
Changing the sliding to match your column numbers will work for column lengths of other sizes. You will need to change the tuple mapping thoug, so this probably is not feasible with a large number of columns.
You can maybe look at List to Tuple Answer to see about the mapping to a tuple of lists of unknown sizes.
Upvotes: 1
Reputation: 41957
If I understand your question and if you have input data without line delimiter as
"line1field1", "line1field2", "line1field3", "line2field1", "line2field2", "line2field3", "line3field1", "line3field2", "line3field3"
And you want output as
+-------------+-------------+-------------+
|Column1 |Column2 |Column3 |
+-------------+-------------+-------------+
|"line1field1"|"line1field2"|"line1field3"|
|"line2field1"|"line2field2"|"line2field3"|
|"line3field1"|"line3field2"|"line3field3"|
+-------------+-------------+-------------+
The following code should help you achieve that
val data = sc.textFile("path to the input file")
val todf = data
.map(line => line.split(",")).map(array => {
val list = new util.ArrayList[Array[String]]()
for(index <- 0 to array.length-1 by 3){
list.add(Array(Try(array(index)) getOrElse "", Try(array(index+1)) getOrElse "", Try(array(index+2)) getOrElse ""))
}
list
})
.flatMap(a => a.toArray())
.map(arr => arr.asInstanceOf[Array[String]])
.map(row => Row.fromSeq(Seq(row(0).trim, row(1).trim, row(2).trim)))
val schema = StructType(Array(StructField("Column1", StringType, true), StructField("Column2", StringType, true),StructField("Column3", StringType, true)))
sqlContext.createDataFrame(todf, schema).show(false)
I hope the answer is helpful
Upvotes: 1