Reputation: 817
I have a textfile that is delimited with ~, I need to perform some parsing before converting to a dataframe. The code reads in a textfile as RDD[String] does some parsing. Then, it converts to RDD[Row]. Then with the schema creates a dataframe.
So below is the following code I have. It works but the problem is the actual schema is 400 fields long. I was wondering if there is a simpler way than typing attributes(1), attributes(2), attributes(3)... so on.
I am currently on Spark 1.6. CDH 5.2.2
Example Input:
20161481132310 ~ ~"This" is a comma 10
Current Code:
val schema_1 = StructType(Array(
StructField("EXAMPLE_1", StringType, true),
StructField("EXAMPLE_2", StringType, true),
StructField("EXAMPLE_3", StringType, true)))
val rdd = sc.textFile("example.txt")
val rdd_truncate = rdd.map(_.split("~").map(_.trim).mkString("~"))
val row_final = rdd_truncate
.map(_.split("~"))
.map(attributes => Row(attributes(0),
attributes(1),
attributes(2)))
val df = sqlContext.createDataFrame(row_final, schema_1)
Based on suggestion I modified for following. It works except on quotes. The "This" in the input will fail. Any suggestions?
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","~")
.schema(schema)
.load("example.txt")
val df_final = df.select(df.columns.map(c =>trim(col(c)).alias(c)): _*)
Upvotes: 0
Views: 292
Reputation: 330063
Just use standard CSV reader:
spark.read.schema(schema).option("delimiter", "~").csv("example.txt")
If you want to trim fields just use select
:
import org.apache.spark.sql.functions.{col, trim}
df.select(df.columns.map(c => trim(col(c)).alias(c)): _*)
If you use Spark 1.x you can use spark-csv
:
sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter", "~")
.load("example.txt")
If this is for some reason not sufficient you can use Row.fromSeq
:
Row.fromSeq(line.split("~").take(3))
Upvotes: 3