Reputation: 2611
I have three different orc files in three different folder, I want to read them all in to one data frame in one shot.
user1.orc at /data/user1/
+-------------------+--------------------+
| userid | name |
+-------------------+--------------------+
| 1 | aa |
| 6 | vv |
+-------------------+--------------------+
user2.orc at /data/user2/
+-------------------+--------------------+
| userid | info |
+-------------------+--------------------+
| 11 | i1 |
| 66 | i6 |
+-------------------+--------------------+
user3.orc at /data/user3/
+-------------------+--------------------+
| userid | con |
+-------------------+--------------------+
| 12 | 888 |
| 17 | 123 |
+-------------------+--------------------+
I want to read all these at once and have the dataframe like below
+-------------------+--------------------+--------------------+----------+
| userid | name | info | con |
+-------------------+--------------------+--------------------+----------+
| 1 | aa | null | null |
| 6 | vv | null | null |
| 11 | null | i1 | null |
| 66 | null | i6 | null |
| 12 | null | null | 888 |
| 17 | null | null | 123 |
so I used like this
val df =spark.read.option("mergeSchema","true").orc("file:///home/hadoop/data/")
but its giving the common column across all files
+-------------------+
| userid |
+-------------------+
| 1 |
| 6 |
| 11 |
| 66 |
| 12 |
| 17 |
So how to read all these three files in one shot ?
Upvotes: 3
Views: 1754
Reputation: 6739
I have a very stupid workaround for you, just in case if you don't find any solution.
Read all those files into different data frames and then perform a union operation, something like below:
val user1 = sparkSession.read.orc("/home/prasadkhode/data/user1/").toJSON
val user2 = sparkSession.read.orc("/home/prasadkhode/data/user2/").toJSON
val user3 = sparkSession.read.orc("/home/prasadkhode/data/user3/").toJSON
val result = sparkSession.read.json(user1.union(user2).union(user3).rdd)
result.printSchema()
result.show(false)
and the output will be:
root
|-- con: long (nullable = true)
|-- info: string (nullable = true)
|-- name: string (nullable = true)
|-- userId: long (nullable = true)
+----+----+----+------+
|con |info|name|userId|
+----+----+----+------+
|null|null|vv |6 |
|null|null|aa |1 |
|null|i6 |null|66 |
|null|i1 |null|11 |
|888 |null|null|12 |
|123 |null|null|17 |
+----+----+----+------+
Looks like there is no support for mergeSchema
for orc
data, there is an open ticket in Spark Jira
Upvotes: 0