Reputation: 367
I'm totally new to Apache Spark and Scala, and I'm having problems with mapping a .csv file into a key-value (like JSON) structure.
What I want to accomplish is to get the .csv file:
user, timestamp, event
ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:52:56,USER_PURCHASED
ad0e431a69cb3b445ddad7bb97f55665,2015-03-06 13:52:57,USER_SHARED
83b2d8a2c549fbab0713765532b63b54,2015-03-06 13:52:57,USER_SUBSCRIBED
ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST
...
Into a structure like:
ec79fcac8c76ebe505b76090f03350a2: [(2015-03-06 13:52:56,USER_PURCHASED), (2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST)]
ad0e431a69cb3b445ddad7bb97f55665: [(2015-03-06 13:52:57,USER_SHARED)]
83b2d8a2c549fbab0713765532b63b54: [(2015-03-06 13:52:57,USER_SUBSCRIBED)]
...
How can this be done if the file is read by:
val csv = sc.textFile("file.csv")
Help is very much appreciated!
Upvotes: 1
Views: 2244
Reputation: 5999
Something like:
case class MyClass(user: String, date: String, event: String)
def csvToMyClass(line: String) =
{
val split = line.split(',')
// This is a good place to do validations
// And convert strings to numbers, enums, UUIDs, etc.
MyClass(split(0), split(1), split(2))
}
val csv = sc.textFile("file.csv")
.map(scvToMyClass)
Of course, do a little more work to have more concrete data types on your class rather than just strings...
This is for reading the CSV file into a structure (seems to be your main question). If you then need to merge all data for a single user you can map to a key/value tuple (String -> (String, String))
instead and use .aggregateByKey()
to join all tuples for a user. Your aggregation function can then return whatever structure you want.
Upvotes: 1
Reputation: 520
Daniel is right.
Later you just have to do:
csv.keyBy(_.user).groupByKey
And that's all.
Upvotes: 0