Spark read from Redshift very slow

Question

I have a requirement in which I need to read data from AWS Redshift and write the result as CSV in AWS S3 Bucket using Apache Spark on a EC2 node instance.

I am using io.github.spark_redshift_community.spark.redshift driver to read the data from Redshift using a query. This driver executes the query and stores the result in a temporary space in S3 in CSV format.

I do not want to use Athena or the UNLOAD command due to certain constraints

I am able to achieve this but the read process from the S3 temp_directory is very slow.

As you can see above, it is taking almost a minute to read from S3 temp_directory and then write to S3 location 10k records of size 2MB

Based on logs, I can tell that storing the Redshift data into the temp_directory of S3 is fairly quick. The delay is happening while reading from this temp_directory

The EC2 instance on which spark is running has IAM role access to the S3 bucket.

Below is the code which reads from redshift

spark.read()
            .format("io.github.spark_redshift_community.spark.redshift")
            .option("url",URL)
            .option("query",    QUERY)
            .option("user", USER_ID)
            .option("password", PASSWORD)
            .option("tempdir", TEMP_DIR)
            .option("forward_spark_s3_credentials", "true")
            .load();

Below is the pom.xml dependencies

    

    
        com.eclipsesource.minimal-json
        minimal-json
        0.9.5
    
    
        org.apache.maven.plugins
        maven-assembly-plugin
        3.3.0
    
    
        org.ini4j
        ini4j
        0.5.4
    
    
        org.projectlombok
        lombok
        1.18.2
    
    
        org.slf4j
        slf4j-api
        1.7.26
    

    
        org.apache.spark
        spark-avro_2.12
        3.3.1
    

    
        io.github.spark-redshift-community
        spark-redshift_2.12
        4.2.0
    

    
        io.delta
        delta-core_2.12
        2.2.0
    

    
        org.scala-lang
        scala-library
        2.12.15
    

    
        org.apache.hadoop
        hadoop-aws
        3.3.1
        provided
    

    
        com.amazonaws
        aws-java-sdk-s3
        1.12.389
        provided
    

    
        com.amazonaws
        aws-java-sdk-bundle
        1.12.389
        provided
    

    
    
        org.apache.spark
        spark-hadoop-cloud_2.12
        3.3.1
        provided
    


    
        org.apache.spark
        spark-sql_2.12
        3.3.1
        provided
    

    
        org.apache.spark
        spark-core_2.12
        3.3.1
        provided
    

    
        org.apache.hadoop
        hadoop-common
        3.3.1
        provided
    

    
        org.apache.hadoop
        hadoop-client
        3.3.1
        provided
    

    
        junit
        junit
        3.8.1
        test

Ishan Sanganeria · Accepted Answer

I found the solution this issue.

Turns out the version 4.2.0 of io.github.spark_redshift_community.spark.redshift driver that I was using was causing this issue.

When I switched to the most recent version which is 5.1.0, the issue was resolved and the same job completed within 10 seconds.

Thanks!

Spark read from Redshift very slow

Answers (1)

Related Questions