Reputation: 53
Every row in my csv file is structured like this:
u001, 2013-11, 0, 1, 2, ... , 99
in which u001 and 2013-11 are UID and date, the number from 0 to 99 are the data value. I want to load this csv file into the Spark DataFrame in this structure:
+-------+-------------+-----------------+
| uid| date| dataVector|
+-------+-------------+-----------------+
| u001| 2013-11| [0,1,...,98,99]|
| u002| 2013-11| [1,2,...,99,100]|
+-------+-------------+-----------------+
root
|-- uid: string (nullable = true)
|-- date: string (nullable = true)
|-- dataVecotr: array (nullable = true)
| |-- element: integer (containsNull = true)
in which dataVector is Array[Int], and the dataVector length is the same for all of the UID and date. I have tried several ways to solve this, including
Using shema
val attributes = Array("uid", "date", "dataVector)
val schema = StructType(
StructField(attributes(0), StringType, true) ::
StructField(attributes(1), StringType, true) ::
StructField(attributes(2), ArrayType(IntegerType), true) ::
Nil)
But this way didn't work well. For the column of data is larger than 100 in my later dataset, I think it is also inconvenience to create the schema including the whole columns of dataVector manually.
Directly load the csv file without schema, and use the method in concatenate multiple columns into single columns to concatenate the column of the data together, but the schema structure is like
root
|-- uid: string (nullable = true)
|-- date: string (nullable = true)
|-- dataVector: struct (nullable = true)
| |-- _c3: string (containsNull = true)
| |-- _c4: string (containsNull = true)
.
.
.
| |-- _c101: string (containsNull = true)
This is still different from what I need, and I didn't find way to convert this struct into what I need. So my question is that how could I load the csv file into the structure what I need?
Upvotes: 5
Views: 2301
Reputation: 35229
Load it without any additions
val df = spark.read.csv(path)
and select:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
// Combine data into array
val dataVector: Column = array(
df.columns.drop(2).map(col): _* // Skip first 2 columns
).cast("array<int>") // Cast to the required type
val cols: Array[Column] = df.columns.take(2).map(col) :+ dataVector
df.select(cols: _*).toDF("uid", "date", "dataVector")
Upvotes: 3