Reputation: 34557
When using pyspark you can set the number of reduces in the sc.textFile
method such that you can read a file quicker form S3 as explained here. This works well, but as of Spark 1.3 we can also start using DataFrames.
Is something like this also possible for Spark DataFrames? I am trying to load them from S3 to a spark cluster (which was created via ec2-spark). Basically I am trying to get this bit of code to run quick for very large 'data.json' files:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
df = sqlContext.jsonFile('s3n://bucket/data.json').cache()
Upvotes: 0
Views: 4085
Reputation: 34557
While waiting for the issue to get fixed I've found a workaround that works for now. The .json
file contains a dictionary for each row so what I could do is first read it in as an RDD textfile and then cast it into a dataframe by specifying the columns manually:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
data = sqlContext.textFile('s3n://bucket/data.json', 30).cache()
df_rdd = data\
.map(lambda x : dict(eval(x)))\
.map(lambda x : Row(x1=x['x1'], x2=x['x2'], x3=x['x3'], x4=x['x4']))
df = sqlContext.inferSchema(df_rdd).cache()
As per the docs. This also means that you could use a .csv
file instead of a json file (which usually saves a lot of disk space) as long as you manually specify the column names in spark.
Upvotes: 0
Reputation: 4648
there's actually a TODO note related to this here and I created the corresponding issue here, so you can up-vote it if that's something you'd need.
Regards,
Olivier.
Upvotes: 1