Parallelizing a for loop with map and reduce in spark with pyspark

Question

In my application, I am creating different data-frames from data in different locations on S3, and then trying to merge the dataframes into a single dataframes. Right now I am using a for loop for this. But I have a feeling this can be done in a much more efficient way using map and reduce functions in pyspark. Here's my code:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, GroupedData
import pandas as pd
from datetime import datetime


sparkConf = SparkConf().setAppName('myTestApp')
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)

filepath = 's3n://my-s3-bucket/report_date='

date_from = pd.to_datetime('2016-08-01',format='%Y-%m-%d')
date_to = pd.to_datetime('2016-08-22',format='%Y-%m-%d')
datelist = pd.date_range(date_from, date_to)

First = True

#THIS is the for-loop I want to get rid of
for dt in datelist:
    date_string = datetime.strftime(dt, '%Y-%m-%d')
    print('Running the pyspark - Data read for the date - '+date_string)
    df = sqlContext.read.format("com.databricks.spark.csv").options(header = "false", inferschema = "true", delimiter = "	").load(filepath + date_string + '/*.gz')

    if First:
        First=False
        df_Full = df
    else:
        df_Full = df_Full.unionAll(df)

Parallelizing a for loop with map and reduce in spark with pyspark

Answers (1)

Related Questions