Spark Dataframe parallel read

Question

When using pyspark you can set the number of reduces in the sc.textFile method such that you can read a file quicker form S3 as explained here. This works well, but as of Spark 1.3 we can also start using DataFrames.

Is something like this also possible for Spark DataFrames? I am trying to load them from S3 to a spark cluster (which was created via ec2-spark). Basically I am trying to get this bit of code to run quick for very large 'data.json' files:

from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
df = sqlContext.jsonFile('s3n://bucket/data.json').cache()

cantdutchthis · Accepted Answer

While waiting for the issue to get fixed I've found a workaround that works for now. The .json file contains a dictionary for each row so what I could do is first read it in as an RDD textfile and then cast it into a dataframe by specifying the columns manually:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')
sqlContext = SQLContext(sc)
data = sqlContext.textFile('s3n://bucket/data.json', 30).cache()
df_rdd = data\
    .map(lambda x : dict(eval(x)))\
    .map(lambda x : Row(x1=x['x1'], x2=x['x2'], x3=x['x3'], x4=x['x4']))
df = sqlContext.inferSchema(df_rdd).cache()

As per the docs. This also means that you could use a .csv file instead of a json file (which usually saves a lot of disk space) as long as you manually specify the column names in spark.

Spark Dataframe parallel read

Answers (2)

Related Questions