Reputation: 464
I have a list of json files which I would like load in parallel.
I can't use read.json("*")
cause files are not in the same folder and there is no specific pattern I can implement.
I've tried sc.parallelize(fileList).select(hiveContext.read.json)
but hive context, as expected, doesn't exists in executor.
Any ideas?
Upvotes: 8
Views: 12309
Reputation: 3029
Function json(paths:String*)
takes variable arguments. (documentation)
So you can change your code like this:
sc.read.json(file1, file2, ...)
Upvotes: 2
Reputation: 109
a solution for pyspark:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
sc = SparkContext("local[2]")
spark = SparkSession.builder.master("local[2]").getOrCreate()
text = sc.textFile("file1,file2...")
ddff = spark.read.json(text)
Upvotes: 1
Reputation: 2967
Also, you can specify directory as a parameter:
cat 1.json
{"x": 1.0, "y": 2.0}
{"x": 1.5, "y": 1.0}
sudo -u hdfs hdfs dfs -put 1.json /tmp/test
cat 2.json
{"x": 3.0, "y": 4.0}
{"x": 1.8, "y": 7.0}
sudo -u hdfs hdfs dfs -put 2.json /tmp/test
sqlContext.read.json("/tmp/test").show()
+---+---+
| x| y|
+---+---+
|1.0|2.0|
|1.5|1.0|
|3.0|4.0|
|1.8|7.0|
+---+---+
Upvotes: 2
Reputation: 464
Looks like I found the solution:
val text sc.textFile("file1,file2....")
val df = sqlContext.read.json(text)
Upvotes: 5