Ishan Sanganeria
Ishan Sanganeria

Reputation: 355

Spark read from Redshift very slow

I have a requirement in which I need to read data from AWS Redshift and write the result as CSV in AWS S3 Bucket using Apache Spark on a EC2 node instance.

I am using io.github.spark_redshift_community.spark.redshift driver to read the data from Redshift using a query. This driver executes the query and stores the result in a temporary space in S3 in CSV format.

I do not want to use Athena or the UNLOAD command due to certain constraints

I am able to achieve this but the read process from the S3 temp_directory is very slow.

enter image description here

As you can see above, it is taking almost a minute to read from S3 temp_directory and then write to S3 location 10k records of size 2MB

Based on logs, I can tell that storing the Redshift data into the temp_directory of S3 is fairly quick. The delay is happening while reading from this temp_directory

The EC2 instance on which spark is running has IAM role access to the S3 bucket.

Below is the code which reads from redshift

spark.read()
            .format("io.github.spark_redshift_community.spark.redshift")
            .option("url",URL)
            .option("query",    QUERY)
            .option("user", USER_ID)
            .option("password", PASSWORD)
            .option("tempdir", TEMP_DIR)
            .option("forward_spark_s3_credentials", "true")
            .load();

Below is the pom.xml dependencies

    <dependencies>

    <dependency>
        <groupId>com.eclipsesource.minimal-json</groupId>
        <artifactId>minimal-json</artifactId>
        <version>0.9.5</version>
    </dependency>
    <dependency>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.ini4j</groupId>
        <artifactId>ini4j</artifactId>
        <version>0.5.4</version>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.2</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.26</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-avro_2.12</artifactId>
        <version>3.3.1</version>
    </dependency>

    <dependency>
        <groupId>io.github.spark-redshift-community</groupId>
        <artifactId>spark-redshift_2.12</artifactId>
        <version>4.2.0</version>
    </dependency>

    <dependency>
        <groupId>io.delta</groupId>
        <artifactId>delta-core_2.12</artifactId>
        <version>2.2.0</version>
    </dependency>

    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.12.15</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-s3</artifactId>
        <version>1.12.389</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-bundle</artifactId>
        <version>1.12.389</version>
        <scope>provided</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hadoop-cloud_2.12</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>


    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.12</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>3.8.1</version>
        <scope>test</scope>
    </dependency>
</dependencies>

Upvotes: 0

Views: 399

Answers (1)

Ishan Sanganeria
Ishan Sanganeria

Reputation: 355

I found the solution this issue.

Turns out the version 4.2.0 of io.github.spark_redshift_community.spark.redshift driver that I was using was causing this issue.

When I switched to the most recent version which is 5.1.0, the issue was resolved and the same job completed within 10 seconds.

Thanks!

Upvotes: 1

Related Questions