Reputation: 355
I have a requirement in which I need to read data from AWS Redshift and write the result as CSV in AWS S3 Bucket using Apache Spark on a EC2 node instance.
I am using io.github.spark_redshift_community.spark.redshift
driver to read the data from Redshift using a query. This driver executes the query and stores the result in a temporary space in S3 in CSV format.
I do not want to use Athena or the UNLOAD
command due to certain constraints
I am able to achieve this but the read process from the S3 temp_directory
is very slow.
As you can see above, it is taking almost a minute to read from S3 temp_directory
and then write to S3 location 10k records of size 2MB
Based on logs, I can tell that storing the Redshift data into the temp_directory
of S3 is fairly quick. The delay is happening while reading from this temp_directory
The EC2 instance on which spark is running has IAM role access to the S3 bucket.
Below is the code which reads from redshift
spark.read()
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",URL)
.option("query", QUERY)
.option("user", USER_ID)
.option("password", PASSWORD)
.option("tempdir", TEMP_DIR)
.option("forward_spark_s3_credentials", "true")
.load();
Below is the pom.xml
dependencies
<dependencies>
<dependency>
<groupId>com.eclipsesource.minimal-json</groupId>
<artifactId>minimal-json</artifactId>
<version>0.9.5</version>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.ini4j</groupId>
<artifactId>ini4j</artifactId>
<version>0.5.4</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.26</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>io.github.spark-redshift-community</groupId>
<artifactId>spark-redshift_2.12</artifactId>
<version>4.2.0</version>
</dependency>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.12.389</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.12.389</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hadoop-cloud_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
Upvotes: 0
Views: 399
Reputation: 355
I found the solution this issue.
Turns out the version 4.2.0
of io.github.spark_redshift_community.spark.redshift
driver that I was using was causing this issue.
When I switched to the most recent version which is 5.1.0
, the issue was resolved and the same job completed within 10 seconds.
Thanks!
Upvotes: 1