Reputation: 41
I am new with Scala and I used to work with python.
I want to convert program from Python to Scala and have difficulties with following 2 lines (create sql dataframe)
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
data = dataset.map(lambda (filepath, text): (filepath.split("/")[-1],text, filepath.split("/")[-2]))
df = sqlContext.createDataFrame(data, schema)
i have made this
val category = dataset.map { case (filepath, text) => filepath.split("/")(6) }
val id = dataset.map { case (filepath, text) => filepath.split("/")(7) }
val text = dataset.map { case (filepath, text) => text }
val schema = StructType(Seq(
StructField(id.toString(), StringType, true),
StructField(category.toString(), StringType, true),
StructField(text.toString(), StringType, true)
))
and now i am blocked there!
Upvotes: 3
Views: 10897
Reputation: 348
For what it is worth I have converted your code literally and the following compiles using spark 2.3.2 on my machine
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import spark.implicits._
// Introduced to make code clearer
case class FileRecord(name: String, text: String)
// Whatever data set you have (a single record dataset is hard coded, replace with your data)
val dataSet = Seq(FileRecord("/a/b/c/d/e/f/g/h/i", "example contents")).toDS()
// Whatever you need with path length 6 and 7 hardcoded (you might want to change this)
// you may be able to do the following three map operations more efficiently
val category = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(6) }
val id = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(7) }
val text = dataSet.map { case FileRecord(filepath, text) => text }
val schema = StructType(Seq(
StructField(id.toString(), StringType, true),
StructField(category.toString(), StringType, true),
StructField(text.toString(), StringType, true)
))
Upvotes: 4